Convert PDF to Images and OCR

AgingKeeper · June 24, 2022, 3:24am

I have found some PDF documents that are scanned copies of WW2 era US Navy deck logs. My grandfather was a member of the crew. The documents are mostly typed on typewriters. I would like to OCR the documents. I could purchase 3rd party PDF apps but first wanted to see if I could use Live Text to do the same job.

It seems to me that I must make each page an image file for Live Text to work. Once I have a picture I can the use the Extract Text from an Image. At that point I am not certain how to create a PDF from the photo and text.

Just wondering if the Automators team had run into this?

Thanks!

Ferrers · June 24, 2022, 7:44pm

Not an automated solution, but what I’d do is upload the PDF to Google Drive (you need a gmail account) and then open with Google Docs — this’ll do the OCR and it’s very good. Then download the Google Doc as a PDF (the text will be selectable) or as a .docx or .rft file that can be opened by Apple’s Pages.

AgingKeeper · June 26, 2022, 1:51am

This did not even occur to me. Thanks so much! I am testing it now.

sebastienkb · June 28, 2022, 1:07am

In case someone else needs a more offline method to pipeline in automation, I’ve had good results with ocrmypdf. All it does is add an invisible text layer on top of the PDF pages so it’s selectable.

I had used this to add OCR to documents my mom regularly has to deal with. It’s command line so I had written a small Automator flow that would ask her for the file on macOS, grab that path, remember the same path with “-withText.pdf” at the end, run said Terminal command with those paths as origin and destination and finally play a sound.

Now every time she needs that she just opens that Automator icon.

Note: I know there’s Shortcuts now, but her 2012 Mac mini is capped at Catalina so no Shortcuts allowed