Using Hazel and OCR to rename pdf scripts?

Hello all, am new to both Hazel and OCR, and wondering if what I’m about to ask is even possible (so this may need some ELI5ing)…

I download a lot of film and TV scripts as PDFs, which I keep stored for reference. There is not a consistent file naming structure going anywhere, and I’d like to get one going for myself as one of these:

‘Film Title - Writer( - draft date, where available).pdf’
‘TV Show - Episode Number - Writer( - draft date).pdf’

This info will always be available on the front page, which will always look like this:

Is it possible to use Hazel to grab this information, then use that info to rename the file? If so how would I do that?

Thank you in advance :slight_smile:

Yes, it should be possible, but it is not going to be a turn key solution. You will need to piece it together.

As long as you can get the raw text out from an OCR’d version of the PDF it seems you should have the data you need. Base on the partial example image, you seem to only need to pick out some of the starting lines given you say the format is pretty standard. Please note however that your format is varying and your partial example image does not indicate where episode numbers of dates would be obtained. So you will have some logic to build in around that - hence it isn’t going to be a turnkey solution.

There is an example of using PDF content to rename a file being processed by Hazel covered in this thread:

This sets out the crux of how to get some data out of your file to be able to process it to pick out the data you need to be able to file/rename it. Note given it looks like you might be using scanned copies of scripts as your source PDFs you might need to OCR first with one rule, and then use that to trigger a subsequent rule (e.g. by renaming the file with a prefix, adding a tag or setting a colour label that is then picked up by the next rule).

Based on the information in the thread you are going to have to take a bit of a step to pull out the data you need - if it even happens to be present. So hopefully you will be able to utilise more than just line number and the lines indicating episodes and dates be able to be matched using other means (you may need to learn about “regular expressions” if you don’t already know them).

While it is a yes, it is possible, it does have some technical complexity to it. But if you have a a lot of these to process, then automating it will of course be worth it.

So start with the referenced thread and see how far you get. You can always ask for further help on specific aspects and provide more comprehensive examples to make it possible for people on the forum to try it out with real example cover pages to test potential advice.

Hope that helps.

Hi @sylumer,

Thanks for this, I’ve had a look at the link you kindly shared and have been experimenting. I’ve given up on dates for now as I’d like to get the basics going first…

Taking my cue from this post here, I worked out that using pdftotext to pull the text from page 1 (post-OCR), and then using sed to pull individual lines, was probably my safest bet - so I’ve attempted to retrofit the code in this post. The pdftotext part works fine; however I cannot for love nor money get an AppleScript to pull the individual lines from the .txt file to tokenise them, even though the sed commands work fine in Terminal. Current end result: a load of files just being renamed ‘pdf’ :tired_face:

The logs don’t shed any light, as per Hazel the rule is running successfully. Ideally I’d eventually like to make each line its own token to allow for formatting differences, but I need to be able to tokenise one first! The answer is certainly in my (total inability to) code - where am I going wrong?

set itemPath to quoted form of POSIX path of theFile
tell application "Finder" to set fileName to name of theFile
set scriptFrontPage to do shell script "/opt/homebrew/Cellar/xpdf/4.05/bin/pdftotext -raw -simple2 -f 1 -l 1 " & itemPath
set scriptTitle to do shell script "echo " & quoted form of scriptFrontPage & " | sed -n '1'p"
set scriptEpisode to do shell script "echo " & quoted form of scriptFrontPage & " | sed -n '2'p"
set scriptWriter to do shell script "echo " & quoted form of scriptFrontPage & " | sed -n '3,4'p"
set fileName to do shell script "echo " & quoted form of scriptTitle
return {hazelExportTokens:{fileName}}