Its just a simple list of links, and on the other end a simple text document. I have been just grabbing the text via Shortcuts using regex and markdown. But this requires me to visit each page and execute the Shortcut - rinse, repeat.
Well I just took a look at that application. Seems easy enough perhaps. I just don’t know all the syntax. Would it be able to distinguish between navigation links and content? What I mean is on the website there is a whole side bar with all kinds of navigation links. But on the main page as all the links to the content I need. It’s still possible?
Wow. That was effective. Thank you. The first link to the files gave me a forbidden error. But the second (lynx) worked perfectly. The script is also well commented. Thanks. I have not had time yet to dig that deep into these things. Too busy teaching! But Shortcuts and Keyboard Maestro were easy entries.
Now I just need to cleanup the markdown and remove a few elements - which I should be able to do using regex and replace - and place the elements into Markdown metadata fields.
I will be removing all links (students tend to click on links regardless) and the “Digital History ID#” into the metadata. I will also be putting the title into a H1, author and date into separate H2. Italicize the “Annotation:” paragraphs and putting a divider above the “Document:” paragraph.
You have just saved me hours of work. Thank you again.
No. This would probably not work, but some sources have “Title”, “Author:” and “Date”. But others don’t have an author listed. Thanks again. More than I was looking for but I am sure glad it is going to save me that much time.
If no, let me know what you’d like to see changed. Pretty much any of those elements can be moved, removed, or reformatted.
(Some of the formatting in the original documents is lacking, and we can’t really fix that but most other stuff is possible.)
I’m headed to bed, but I’ll check back tomorrow.
ps - the biggest issue that I see is that if the “Annotation” is more than one paragraph, I don’t think the italics will work. Now, if we were making HTML, we could wrap the whole thing in a <div id=annotation> and then use CSS to tell it to italicize the whole thing. There may be a way around that, though. I’ll have a think on it in the morning.
pps - I also put the ID number in the filename because I thought that might be useful for you later on if you need to go back and find the original, but I can easily remove that too if it’s not helpful.
What changes did you make to the script so that I may execute this in the future? There are actually dozens of pages that have court-cases, documents, etc, So I am going to capture all of that too at some point.
Regarding the annotation section, its fine. I just think that a couple paragraphs of text, indicated only by “Annotation:” is not quite clear enough for students. So I suppose bold is enough. That is also why I wanted to have a divider above the “Document” declaration. It was not quite clear enough where the annotation ends and the document begins.
The id in the filename is fine.
Just so you can get an idea what I am doing overall here, is that I have a database (in Airtable at the moment) that has hundreds of sources. In the database I have the title, historical era, keywords, date, etc in different fields. I want students to be able to filter and search for sources.
On my end I will be putting all these files into my DEVONthink database, and converting them to PDF through a custom CSS. As a historian, I am also hoping to use DEVONthink’s concordance and textual analysis features for some quantitative historical analysis.
Not to take the wind from your sails but you’re almost certainly going to be disappointed. See recent threads in the DT forums re its shortcomings for QDA.
Also you should be aware that what you’re doing would appear to violate the website’s terms of service: Digital History Copyright . I was going to suggest you contact the site owners and see if you can get a copy of the data directly, and came across this notice.
Regarding the copyright issue, I am fully within the fair use doctrine as outline by US law and DigitialHistory’s guidelines (I have sat through too many hours of copyright law workshops and meetings it’s crazy - though thankfully it gave me some confidence). Additionally the TEACH Act has provided further exemptions for F2F educational settings (I am a college professor). Thanks for the heads up though!
“… marked by the charming naivet� and tender pathos…”
That’s probably tr. It’s sometimes bad with UTF-8. I’ll see if I can fix that.
Also, the “Annotations” section is very wide, but the “Documents” section is not. That should be uniform. That’s just a matter of adjusting the lynx command when the script gets the Annotation section.
wget would make it possible to suck down all of the URLs into a local collection of HTML files (something @Mark_Robertson may want to do in the future).
But for converting that HTML into plain text (which is the original stated purpose), we were going to need lynx or something like html2text.py.
Once we were getting into more fine-grained manipulation of the text, lynx seemed like the easier / better choice.
So, at that point, we definitely needed lynx and we did not needwget and since wget is not included in macOS by default (presumably for some license reason, I don’t know), I decided not to add another non-standard dependency.
Wow I presumably have forgotten when I installed wget…maybe as part of macports? Been so long I would have definitely said wget is installed by default, and lynx is not. I have only ever used lynx interactively and probably not since 2007 or so – more on cygwin than on a Mac.