Convert MS Word bullet list to HTML links

DomBett · April 26, 2019, 3:38pm

Here’s a bit of a challenge. I regularly get Word docs that include a bulleted list in the following format:
• Some Explanatory Text: https://www.someurl.com

There’s always a bullet, followed by plain text, a colon and then the URL. I would like to automate a conversion of each line to HTML:

<li><a href="https://www.someurl.com">Some Explanatory Text</a></li>

Any ideas? If I had to do it on iOS, I’d take that too. (Asking for them to be formatted differently in creation if not on the table, unfortunately.) Thanks.

AAALLL · April 26, 2019, 6:15pm

You can do this with basically any programming language. Do you have a language your most conferrable with?

DomBett · April 26, 2019, 7:23pm

I would say I’m most comfortable with Applescript. I recognize that there’s a pattern to match, but I was hoping someone could get me started in the right direction.

AAALLL · April 27, 2019, 12:50am

Yes, I can easily get you started.

Btw if you are the one that creates these documents and wanted to make it less complicated of a script you could reverse the order (url first then explanation) it would be MUCH easier. But I will work with we got for now.

Heres what I would recommend:

First export the Word file as a txt.
Read the text file line by line.
Save the line as a variable; make 2 other copies of that variable (for redundancy purposes)
Split the first by • and : {I’ll call this TEXT}.
Split the second by “https” and .com (if you need .org/ any others, you could also just split it until end of line) {I’ll call this LINK}.

Im not sure where you want to output these to, so I’ll just show you how to create the HTML

Finally you’ll need to concatenate the strings.

“<li><a href=‘https’” & LINK & “>” & “TEXT” & " //Depending on how you split the LINK, what you put after the href= may change.

Something like this should get you close. Let me know if there is any other help I can give.

DomBett · April 27, 2019, 2:09pm

Thank you, that’s a big help. I am running into a snag. I know how to split the line based on a delimiter, and I can see why we can’t just split based on the colon (there being two colons in there), but I just don’t know how to split based on a beginning and ending delimiter. Thanks for your help. I’ve been an Applescript tinkerer for years but haven’t really progressed beyond beginner/intermediate level.

AAALLL · April 27, 2019, 4:52pm

I don’t know if I understand your entire question. I do see that you are trying to split it, and splitting it by the colon will create 3 strings. I dont think there is a way in AppleScript to split a string by only the first occurrence of a colon. So, what I would recommend is splitting it up into three strings (Some Explanatory Text, the https and the //www.someurl.com)

Then when you are creating the final concatenated string at the end, putting those together in the rearranged order and adding the https: as a string is probably your best bet.

Let me know if that answered your questions.

This may also help https://stackoverflow.com/questions/1716440/applescript-index-of-substring-in-string

dustinknopoff · April 27, 2019, 6:50pm

If you have python3, the below script will run on macOS and in Pythonista. It uses regular expressions to extract the two parts you need.

Python

import re
import sys


def find_and_format(contents):
    """find_and_format
    File must be a series of text in the format:
        • Some Explanatory Text: https://www.someurl.com
    separated by newlines
    Arguments:
        contents {[str]} -- [contents of a file]
    """
    r = re.compile(r'(?<=• )(?P<name>.*(?=: )): (?P<url>.*)')
    matches = re.findall(r, contents)
    html_out = ""
    for match in matches:
        name = match[0]
        url = match[1]
        html_out += "<li><a href=\"" + url + "\">" + name + "</a></li>"
    print(html_out)


if __name__ == '__main__':
    with open(sys.argv[1], 'r') as f:
        contents = f.read()
        find_and_format(contents)

Otherwise, you can try using pandoc, a command line tool that converts between many text types (docx, html, pdf, markdown, and a lot more)

DomBett · April 30, 2019, 10:05pm

Thanks for your help, everyone. I decided to go in a different direction and work it out in Keyboard Maestro. It’s ab it more manual than I’d had it before, but I will continue to refine it to make it more automated. So now, I paste the bulleted list into BBedit, strip out the bullets, select the URL in the first line, invoke the KM macro, select the next URL and repeat until done and then have BBedit add the HTML for line item back in. Perhaps not as quick as it can be yet, but I got 90% of what I wanted with minimal effort and that’s good enough for now.

Thanks again for helping me think through the structure of the problem.