A Kind of Challenge to help out a historian - Crawling a list of HTML links and grabbing text from each link?

Mark_Robertson · November 3, 2019, 12:07am

Is this possible??

I have some experience with Keyboard Maestro and iOS Shortcuts, but my knowledge I don’t think will help me here.

Here is the page - link and image: http://www.digitalhistory.uh.edu/references/landmark.cfm

Its just a simple list of links, and on the other end a simple text document. I have been just grabbing the text via Shortcuts using regex and markdown. But this requires me to visit each page and execute the Shortcut - rinse, repeat.

Can this be done in some other fashion?

Tjluoma · November 3, 2019, 12:29am

This would be fairly easy to script on a Mac, probably using wget and maybe a few other tools.

To clarify: you’re looking for just the text? No images, of each of those linked articles, and would like each one saved on its own text file?

Mark_Robertson · November 3, 2019, 1:08am

Precisely. I am unfortunately not familiar with AppleScript. Is there something I could look at? Kind of documentation?

Mark_Robertson · November 3, 2019, 1:13am

Well I just took a look at that application. Seems easy enough perhaps. I just don’t know all the syntax. Would it be able to distinguish between navigation links and content? What I mean is on the website there is a whole side bar with all kinds of navigation links. But on the main page as all the links to the content I need. It’s still possible?

Tjluoma · November 3, 2019, 1:27am

OK, so I wrote a shell script. Because, well, it’s what I do. You can find it here:

https://files.luo.ma/automators-talk/5900/links-to-markdown.sh

The script doesn’t use wget but instead uses lynx and (optionally) html2text.py.

I’ve commented on the shell script so hopefully it might be a useful tool for learning how to do things like this in the future.

I also downloaded all of the links to text files…

… once using html2text.py: https://files.luo.ma/automators-talk/5900/html2text.zip

…and once using lynx: https://files.luo.ma/automators-talk/5900/lynx.zip

Check out the files in those zips and see if either/both of them are what you want.

Updated to fix link. See below.

Mark_Robertson · November 3, 2019, 1:41am

Wow. That was effective. Thank you. The first link to the files gave me a forbidden error. But the second (lynx) worked perfectly. The script is also well commented. Thanks. I have not had time yet to dig that deep into these things. Too busy teaching! But Shortcuts and Keyboard Maestro were easy entries.

Now I just need to cleanup the markdown and remove a few elements - which I should be able to do using regex and replace - and place the elements into Markdown metadata fields.

Tjluoma · November 3, 2019, 2:17am

oops, try https://files.luo.ma/automators-talk/5900/html2text.zip for the other one. Fixed permissions error.

You might like the formatting of the html2text ones better.

If you let me know what elements you want in the fields, I might be able to add that to the script. Let me know, I’ll be around tomorrow too.

Mark_Robertson · November 3, 2019, 3:06am

I will be removing all links (students tend to click on links regardless) and the “Digital History ID#” into the metadata. I will also be putting the title into a H1, author and date into separate H2. Italicize the “Annotation:” paragraphs and putting a divider above the “Document:” paragraph.

You have just saved me hours of work. Thank you again.

Tjluoma · November 3, 2019, 4:10am

Ah… well, a couple of those are easy. For example, if I change this:

lynx -dump -nomargins -width='10000' -assume_charset=UTF-8 -pseudo_inlines "$line" >>| "$FILENAME"

to this

lynx -nonumbers -nolist -dump -nomargins -width='10000' -assume_charset=UTF-8 -pseudo_inlines "$line" >>| "$FILENAME"

It will not include the links.

Let me play with a bit and see what else I can come up with.

Tjluoma · November 3, 2019, 4:17am

When you say ‘author’ do you mean the “source” such as

“Source: From Revolution to Reconstruction”

from http://www.digitalhistory.uh.edu/disp_textbook.cfm?smtID=3&psid=3993 ?

Mark_Robertson · November 3, 2019, 4:41am

No. This would probably not work, but some sources have “Title”, “Author:” and “Date”. But others don’t have an author listed. Thanks again. More than I was looking for but I am sure glad it is going to save me that much time.

Tjluoma · November 3, 2019, 5:42am

Oh! So they do. Ok, that’s no problem to deal with.

I love it when scripting can save someone time, especially when it’s time spent doing something fairly unexciting like cleaning up text

Take a look at the text files in https://files.luo.ma/automators-talk/5900/ and see if that’s what you’re looking for.

If yes, https://files.luo.ma/automators-talk/5900/Archive-01.zip is a zip with all of the text files in it.

If no, let me know what you’d like to see changed. Pretty much any of those elements can be moved, removed, or reformatted.

(Some of the formatting in the original documents is lacking, and we can’t really fix that but most other stuff is possible.)

I’m headed to bed, but I’ll check back tomorrow.

ps - the biggest issue that I see is that if the “Annotation” is more than one paragraph, I don’t think the italics will work. Now, if we were making HTML, we could wrap the whole thing in a <div id=annotation> and then use CSS to tell it to italicize the whole thing. There may be a way around that, though. I’ll have a think on it in the morning.

pps - I also put the ID number in the filename because I thought that might be useful for you later on if you need to go back and find the original, but I can easily remove that too if it’s not helpful.

Mark_Robertson · November 3, 2019, 1:13pm

Excellent. The files look great.

What changes did you make to the script so that I may execute this in the future? There are actually dozens of pages that have court-cases, documents, etc, So I am going to capture all of that too at some point.

Regarding the annotation section, its fine. I just think that a couple paragraphs of text, indicated only by “Annotation:” is not quite clear enough for students. So I suppose bold is enough. That is also why I wanted to have a divider above the “Document” declaration. It was not quite clear enough where the annotation ends and the document begins.

The id in the filename is fine.

Just so you can get an idea what I am doing overall here, is that I have a database (in Airtable at the moment) that has hundreds of sources. In the database I have the title, historical era, keywords, date, etc in different fields. I want students to be able to filter and search for sources.

On my end I will be putting all these files into my DEVONthink database, and converting them to PDF through a custom CSS. As a historian, I am also hoping to use DEVONthink’s concordance and textual analysis features for some quantitative historical analysis.

Thanks again.

dfay · November 3, 2019, 4:01pm

Not to take the wind from your sails but you’re almost certainly going to be disappointed. See recent threads in the DT forums re its shortcomings for QDA.

Also you should be aware that what you’re doing would appear to violate the website’s terms of service: Digital History Copyright . I was going to suggest you contact the site owners and see if you can get a copy of the data directly, and came across this notice.

Tjluoma · November 3, 2019, 4:16pm

I wrote a different script that is more specific to this site. You can find it here:

https://files.luo.ma/automators-talk/5900/digitalhistory.sh

You can run it either like this:

digitalhistory.sh 'http://www.digitalhistory.uh.edu/disp_textbook.cfm?smtID=3&psid=4082' \
'http://www.digitalhistory.uh.edu/disp_textbook.cfm?smtID=3&psid=4085' \
'http://www.digitalhistory.uh.edu/disp_textbook.cfm?smtID=3&psid=4084' \
'http://www.digitalhistory.uh.edu/disp_textbook.cfm?smtID=3&psid=4086' \ 
'http://www.digitalhistory.uh.edu/disp_textbook.cfm?smtID=3&psid=4087' \
'http://www.digitalhistory.uh.edu/disp_textbook.cfm?smtID=3&psid=4063'

to save the files by the URLs…

Or like this

digitalhistory.sh 3970 4064 3948 3962

to save by the ID number.

Mark_Robertson · November 3, 2019, 4:17pm

I will look into the DT issue.

Regarding the copyright issue, I am fully within the fair use doctrine as outline by US law and DigitialHistory’s guidelines (I have sat through too many hours of copyright law workshops and meetings it’s crazy - though thankfully it gave me some confidence). Additionally the TEACH Act has provided further exemptions for F2F educational settings (I am a college professor). Thanks for the heads up though!

Tjluoma · November 3, 2019, 4:19pm

There is a problem with the script. It seems to have a problem with non-ASCII characters.

For example, see https://files.luo.ma/automators-talk/5900/An%20Indian’s%20Views%20of%20Indian%20Affairs%20(4054).txt

and look near the top and you will see this

“… marked by the charming naivet� and tender pathos…”

That’s probably tr. It’s sometimes bad with UTF-8. I’ll see if I can fix that.

Also, the “Annotations” section is very wide, but the “Documents” section is not. That should be uniform. That’s just a matter of adjusting the lynx command when the script gets the Annotation section.

dfay · November 3, 2019, 4:36pm

@Tjluoma out of curiosity why didn’t you use wget?

Tjluoma · November 3, 2019, 4:48pm

That’s a good question, actually.

wget would make it possible to suck down all of the URLs into a local collection of HTML files (something @Mark_Robertson may want to do in the future).

But for converting that HTML into plain text (which is the original stated purpose), we were going to need lynx or something like html2text.py.

Once we were getting into more fine-grained manipulation of the text, lynx seemed like the easier / better choice.

So, at that point, we definitely needed lynx and we did not need wget and since wget is not included in macOS by default (presumably for some license reason, I don’t know), I decided not to add another non-standard dependency.

dfay · November 3, 2019, 5:10pm

Wow I presumably have forgotten when I installed wget…maybe as part of macports? Been so long I would have definitely said wget is installed by default, and lynx is not. I have only ever used lynx interactively and probably not since 2007 or so – more on cygwin than on a Mac.