Extracting text from a PDF and exporting to a delimited file

kerjsmit · September 13, 2018, 6:32pm

Hi everyone. I’ve searched around for avenues toward this, but I can’t find anything. I’m putting this request in macOS, but if it’s possible in iOS, I’m cool with that, too.

So here’s what I’m trying to do. I regularly have to read PDFs (always OCRed) and find proper names and the page numbers they appear on, then export, copy and paste, or (the horror!) type the names and the page numbers into a delimited file (text, Excel, etc.)—for example, as follows:

Tolkien,12
Tolkien,12
Tolkien,13
Tolkien,14
Lee,14
Smith,14
Roberts,14
Smith,14
Roberts,14
[etc.]

What I would like is a way to automate this. My limited brain bandwidth has conceived of the following options, but I’m sure there are more:

(1) search the PDF (presumably using regular expressions), locate capitalized words (which of course can’t be differentiated from proper names), and export them to a delimited file with the page number they appear on;

(2) search the PDF (presumably using regular expressions), locate capitalized words, and underline or highlight or in some other way differentiate them. PDF Expert can produce a delimited report from highlights and/or underlines, which is the extremely helpful option I already use for other aspects of my work.

My current (sad) workflow is as follows: Using PDF Expert on my iPad Pro, as I read the file for other purposes, I underline the proper names I come across. PDF Expert then allows me to export a report of the words I’ve underlined, along with the page number. This is wonderful, but I would very much like to automate this process as much as possible so that I can focus on my other responsibilities vis-à-vis the document in question.

Is this even possible? I would be more than happy to pay for software that would make this doable, because it would save me untold hours. For the past couple of days I’ve thought hard about somehow using Keyboard Maestro, PDF Expert, Skim, Hazel, PDFpen, Adobe, Automator, AppleScript, and/or any other combination of tools to make this happen, but I’m at a loss.

I would deeply appreciate any insights you might have.

Many thanks,
Kerry

kerjsmit · September 13, 2018, 6:38pm

I should add this: Even if the desired text can’t be exported, it would be greatly helpful if it could be underlined or highlighted or in some other way clearly marked.

dfay · September 14, 2018, 3:52am

Is there a finite / fixed list of names you’re looking for? Or are there likely to be new names in each document?

Ben_Lincoln · September 14, 2018, 7:02am

How structured is this text page? You can maybe just brute force some regular expressions.

Else you could write a script in the language of your choice to take all the text and iterate over it checking each word against a database of known names. With probably some regex in there for good measure, or alternatively you could strip out every word that is not a name.

You could also use an off the shelf machine learning solution to analyse the document for you looking for names, which is something I think the IBM Watson discovery service can do

kerjsmit · September 14, 2018, 1:52pm

Thanks for the suggestions, Ben!

kerjsmit · September 14, 2018, 1:54pm

The names will be different in each document.

rlivingston · September 15, 2018, 10:59am

You will be looking for all the capitalized words in the document. The regex expression for this is something like

(\b[A-Z][A-Za-z]{2,20}\b)

This will find all names like Robert, Paul, McKeskey, VanRecklinghouse. We will assume that names are between 2 and 20 characters in length.

But, as you allude to, the pattern is not specific to actual people’s names: NATO, January, April, and words at the start of sentences will also match. [This; We; But from the previous couple sentences.]

You have not really told us how the documents that you work with are actually structured. But assuming the basic worst case scenario that these documents have the general character of a book or newspaper, false positives will actually dominate. Most of these regex matches will not actually be names. Assume also (as it would be in a book or newspaper) that context is important in determining if a word is actually a name. For example, “The Trojan War started on April 2 in Washington.” Without context, how are you to know whether Trojan, War, April, Washington are people’s names or not?

So I would approach this as a problem of extracting the text with the page numbers from the PDF. I happen to use PDFpenPro, but I image that PDF Expert would have a similar feature. You have to make sure that you have the page number.

In PDFpenPro, you can create headers for your PDF that contain the page number. In PDFpenPro, those headers can take the format of left, center, right. I would put page__ on left, the page number in center.

Then I would select the entire document and paste it into a text editor such as BBEdit or TextEdit. This process will create a text file in which every line of the PDF is defined as a line in the text editor (terminated by a \n (new line). Crucially, the page numbers will be included. This will be an easy thing to process with a program.

It will look something like

page__1
CHAPTER 1. Loomings.
Call me Ishmael. Some years ago—never mind how long precisely—having little or
no money in my purse, and nothing particular to interest me on shore, I thought I would
sail about a little and see the watery part of the world. It is a way I have of driving off the
spleen and regulating the circulation. Whenever I find myself growing grim about the
mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself
involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral
page__2
I meet; and especially whenever my hypos get such an upper hand of me, that it requires
a strong moral principle to prevent me from deliberately stepping into the street, and
methodically knocking people’s hats off—then, I account it high time to get to sea as
soon as I can. This is my substitute for pistol and ball. With a philosophical ﬂourish Cato
throws himself upon his sword; I quietly take to the ship. There is nothing surprising in
this. If they but knew it, almost all men in their degree, some time or other, cherish very
nearly the same feelings towards the ocean with me.
page__3
etc.

The regex pattern provided at the top of this reply will find:
CHAPTER; Loomings; Call; Ishmael; Some; It; Whenever; November; This; With; Cato; There; If and so on.

Then frankly, I would write code in a language that has regex capabilities built in, to just march through the document highlighting each of these capitalized words in context and providing the user with two buttons to press: Is A Name and Is Not a Name. A very simple interface. The program would know what the page was and every time the user clicked on Is A Name, that capitalized word would be appended to another text file in the fashion that you want:

Ishmael, 1
Cato, 2

The task could be made considerably less onerous by giving the program some elementary intelligence so as to skip over things that are NOT names and commonly occur at the beginning of sentences: Some, It, This, There etc. Your could give the program an additional button, This Is Never a Name, and it could quickly learn and remember these common false positives.

This is not a complex program to write. But it does require knowledge of some programming language. I do not think that trying to do this in Keyboard Maestro/Automator etc. would get you even close to an efficient workflow.

If you are spending hours doing this almost mindless task, getting such a program written has to be worth it.

Ben_Lincoln · September 15, 2018, 12:22pm

So I implemented your suggestion, poorly in python, using Pythonista in bed, so this is a solution that runs on iPad!

I am still very early on teaching myself python and this was thrown together very quickly, I plan to rewrite it in Java (what I write professionally) and add in all the missing bits, like data persistence, but that may take me a bit to get around to.

First up, you should really never run scrips from rando’s on the internet without understanding how they work, it can get you in a lot of trouble! That said, here is a script from a rando on the internet to do what you want to do, it is hosted on my GitHub

The quick rundown is that we import re, to run the regular expressions.

We setup an array of words to remove from the output, it case does not matter with these words.

We setup a basic text UI loop

We ask the user for input, if they want to exit we exit.

Else we treat incoming text as something to look at.

We compile the regex and run it to get a list of just the matching words. We the loop through the list of words to remove and the matched list removing items that should not have been matched.

Then we print the results.

I re-wrote the suggested regex to not have the 20 char limit because I failed to understand why that would be useful.

Just run the script and put the pages in one at a time to know the name count for that page and update the wordsToIgnore array to persist words to ignore

kerjsmit · September 15, 2018, 2:44pm

Thank you so much @rlivingston and @Ben_Lincoln! I really appreciate your help with this.

rlivingston · September 15, 2018, 6:01pm

Ben has substituted the regex:

[A-Z][a-z]+

I would agree that the 20 letter limit is somewhat arbitrary and not too useful, but Ben’s solution introduces the following problem. Names like McDonald get split up into ‘Mc’ and ‘Donald’. Perhaps this pattern is better.

\b[A-Z][A-Za-z]*[a-z]\b

This pattern insists that the first letter be a capital and the last letter be lower case. Any other letter can be capitalized or not. All words that meets this criteria will be flagged.

So McDonald and Al and Gore are all successful.

A and eBook and iPad and NATO will not match.

The \b is in the pattern to insist that these are individual words and not some capital lingering in a word. The trailing \b is not really necessary. It just seems clearer to me but the below is equivalent.

\b[A-Z][A-Za-z]*[a-z]

In some text containing eBook and iPad, Ben’s implementation will return Book and Pad in the names list.

P.S. In Python (where I am also a novice) it is useful to put the letter r before the string so it gets treated as a Raw String. Otherwise the \b at the start of the pattern will be misinterpreted. So Ben, in your code, it might be better to use

pattern = re.compile(r’[A-Z][a-z]+')

rather than

pattern = re.compile(‘[A-Z][a-z]+’)

P.S.S

There are complexities in certain foreign names. How they are to be handled is complex. Suffice to say, with the patterns above, Charles de Gaulle would be captured as Charles and Gaulle. The ‘de’ would be lost. With customized programing and a fair amount of effort, you could address this issue but I would just handle this with human intervention.

https://academia.stackexchange.com/questions/15326/how-to-deal-with-particles-in-a-last-name-in-a-reference-list

One thing to keep in mind is that there is a substantial difference in several continental languages between uppercase and lowercase versions of a last name: it is wrong to write “de Martino” if the person’s last name is normally written “De Martino.” This is a historical artifact, where the use of the capital letter indicates nobility, while the lowercase letter denotes a more traditional relationship. Similar rules apply to “von” in German and “van” in Dutch, but not to “de” in French or Spanish.

Therefore, when capitalized, the particle should always be treated as part of the last name. If lowercase, you can treat it as a suffix that goes after the first name. The exception are names like “de Gaulle” where “de” is followed by a one-syllable name.

So, it’s:

Beethoven, Ludwig van

Clausewitz, Carl von

de Gaulle, Charles

Di Martino, Emilia

Martino, Emilia di

Maupassant, Guy de

Van Allen, James

Ben_Lincoln · September 15, 2018, 8:55pm

@rlivingston that is some great feedback, and I have opened some issues on the repo to deal with them found here