Automatically OCRing Documents with Hazel and PDFPen Pro

hazel

#1

Katie Floyd previously had a fabulous post on her blog detailing how to automatically OCR a PDF, using Hazel and PDFPen Pro. As her blog is no longer available I am sharing her wisdom here:

The first thing you need is a Hazel rule, how you set it up depends on your needs, this is mine:

The rule is very simple, it looks for anything that is a PDF, and runs an embedded AppleScript, documented below.

tell application "PDFpenPro"
    open theFile as alias
    tell document 1
        ocr
        repeat while performing ocr
            delay 1
        end repeat
        delay 1
        close with saving
    end tell
    quit
end tell

Full credit to David Sparks and Katie Floyd for originally providing the scripts and ideas!


Hazel OCR script help
#2

Thank you very much. The workflow is great. However, when using PDF Pen Pro 10 to perform OCR in long Spanish documents, it tends to get stuck (at least in my experience). In such cases, I have to use Adobe Acrobat. Could the workflow be modified to work with Acrobat for Mac?


#3

I have this same problem, with larger documents hanging at around 60% completed. I don’t own Adobe Acrobat, so I’d be especially interested in a way to automate OCRing a batch of pages – perhaps 30 at a time or so.


#4

Thanks for this. I use PDFPenPro and also DEVONthink Pro Office for OCR. Does anyone know which has the better OCR engine? Thanks.


#5

More expert scripters can correct me if I’m wrong, but in my understanding Acrobat does not let you manipulate OCR via AppleScript. You may be able to accomplish the goal using Keyboard Maestro that lets you script the UI.


#6

This is what I remember as well I’m afraid.


#7

In terms of the length issue…

  1. Has anyone logged this issue (including an example) with Smile?
    • Does this happen only when automating or when using the UI manually too?
  2. As a workaround, how about automating the splitting of large documents into smaller ones, OCRing them, and then recombining them back into one PDF?

#8

I haven’t compared the two that closely, but I think DEVONthink Pro Office does a better job, especially on documents that are fuzzy. DEVONthink Pro Office uses the Abbyy FineReader OCR engine/software, which is the OCR engine used by the Fujitsu ScanSnap. I use the FineReader Pro scanner app on iOS and it puts other scanner apps OCR to shame.


#9

Thanks for that info.


#10

I haven’t logged the issue with Smile. My bad. I would do it the next time it happens. Thanks for your suggestion.


#11

I’ve used a slightly-modified version of the script that only does the OCR if the PDF requires it (skips if the PDF already has a text layer) and doesn’t close PDFPenPro if another document is open:

tell application "PDFpenPro"
    open theFile as alias
-- does the document need to be OCR'd?
get the needs ocr of document 1
if result is true then
	tell document 1
		ocr
		repeat while performing ocr
			delay 1
		end repeat
		delay 1
		close with saving
	end tell
	--In PDFpen, when no documents are open, window 1 is "Preferences"
	--If other documents are open, do not close the App.
	if name of window 1 is "Preferences" then
		tell application "PDFpenPro"
			quit
			end tell
   		end if
	else
		-- Scan Doc was previously OCR'd or is already a text type PDF.
		tell document 1
			close without saving
		end tell
		--In PDFpen, when no documents are open, window 1 is "Preferences"
 		--If other documents are open, do not close the App.
		if name of window 1 is "Preferences" then
			tell application "PDFpenPro"
				quit
			end tell
		end if
	end if
end tell

#12

I offload this sort of processing to PDFPen Pro on my mac mini server and so always close the app when it is finished. When I open it on my Macbook Pro, it remembers what PDFs I was currently working with, so I don’t think the checking of how many files are open would work correctly for me if I was using it there. I figure it would almost always stay open.

Maybe checking for if PDFPen Pro is active prior to opening the file to OCR might be more reliable for those who have files automatically reopen in the app?

The checking if OCRing is necessary is a really nice enhancement. Whilst I think I only send non-OCR’d PDFs to be OCR’d this would negate any need to check on my part and I’m all for that.


#13

@jmreekes and @RosemaryOrchard, is there an automation like this one that will use DevonThink Pro Office to do the OCRing instead of PDFpen Pro?


#14

I used AppleScript to script the UI in Acrobat X in order to OCR a file.

tell application "Adobe Acrobat Pro"
activate
set theFile to "Macintosh HD:Users:Jim:Documents:img0005.pdf"
open theFile as alias
delay 2
activate
tell application "System Events"
	tell application process "Acrobat"
		-- run action to OCR
		click menu item "OCR_file" of menu 1 of menu item "Action Wizard" of menu 1 of menu bar item "File" of menu bar 1
		delay
		-- perform the OCR
		click button "Next" of window 1
		
		set OCRdone to false
		repeat while OCRdone is false
			try
				set OCRdone to true
				click button "Close" of window 1 -- will create error unless OCR is done
			on error
				set OCRdone to false
			end try
		end repeat
	end tell
	keystroke "w" using command down
end tell
quit
end tell

Hazel OCR script help
#15

Hey Folks,

I wanted to mention that Prizmo with the Pro Pack (also available with a SetApp sub) has hooks that Automator can access. I have a similar setup to the ones described above: Hazel watches my Downloads folder, looking for image-only PDFs, and auto OCRs the files. This workflow requires no scripting–perhaps of interest to folks out there, like me, who can’t code, yet!


#16

Thanks for sharing this. I just set this up and tried it out on a couple documents, and the documents I got back looked to be the OCR layer only (i.e., the visible appearance of the document was changed). Do you know if there is a way to get a “normal” OCR’d document via Prizmo and Automator, with the visible part of the PDF appearing unchanged, but containing the invisible OCR layer?

EDIT: serves me right for not digging into this more before posting. There is a setting in the Automator workflow to use "PDF (Image + Searchable Text)


#18

Does anyone have an idea how something similar could be achieved (with Hazel) for Business Cards, where ideally the scan is ocr’ed and then the contact information is added to my contacts and the scan is archived?

Many Thanks.


#19

Sometimes I face the same issue. Next time it happens to me, I will also report it to Smile.