Using Hazel and OCR to rename pdf scripts?

Hello all, am new to both Hazel and OCR, and wondering if what I’m about to ask is even possible (so this may need some ELI5ing)…

I download a lot of film and TV scripts as PDFs, which I keep stored for reference. There is not a consistent file naming structure going anywhere, and I’d like to get one going for myself as one of these:

‘Film Title - Writer( - draft date, where available).pdf’
‘TV Show - Episode Number - Writer( - draft date).pdf’

This info will always be available on the front page, which will always look like this:

Is it possible to use Hazel to grab this information, then use that info to rename the file? If so how would I do that?

Thank you in advance :slight_smile:

Yes, it should be possible, but it is not going to be a turn key solution. You will need to piece it together.

As long as you can get the raw text out from an OCR’d version of the PDF it seems you should have the data you need. Base on the partial example image, you seem to only need to pick out some of the starting lines given you say the format is pretty standard. Please note however that your format is varying and your partial example image does not indicate where episode numbers of dates would be obtained. So you will have some logic to build in around that - hence it isn’t going to be a turnkey solution.

There is an example of using PDF content to rename a file being processed by Hazel covered in this thread:

This sets out the crux of how to get some data out of your file to be able to process it to pick out the data you need to be able to file/rename it. Note given it looks like you might be using scanned copies of scripts as your source PDFs you might need to OCR first with one rule, and then use that to trigger a subsequent rule (e.g. by renaming the file with a prefix, adding a tag or setting a colour label that is then picked up by the next rule).

Based on the information in the thread you are going to have to take a bit of a step to pull out the data you need - if it even happens to be present. So hopefully you will be able to utilise more than just line number and the lines indicating episodes and dates be able to be matched using other means (you may need to learn about “regular expressions” if you don’t already know them).

While it is a yes, it is possible, it does have some technical complexity to it. But if you have a a lot of these to process, then automating it will of course be worth it.

So start with the referenced thread and see how far you get. You can always ask for further help on specific aspects and provide more comprehensive examples to make it possible for people on the forum to try it out with real example cover pages to test potential advice.

Hope that helps.

Hi @sylumer,

Thanks for this, I’ve had a look at the link you kindly shared and have been experimenting. I’ve given up on dates for now as I’d like to get the basics going first…

Taking my cue from this post here, I worked out that using pdftotext to pull the text from page 1 (post-OCR), and then using sed to pull individual lines, was probably my safest bet - so I’ve attempted to retrofit the code in this post. The pdftotext part works fine; however I cannot for love nor money get an AppleScript to pull the individual lines from the .txt file to tokenise them, even though the sed commands work fine in Terminal. Current end result: a load of files just being renamed ‘pdf’ :tired_face:

The logs don’t shed any light, as per Hazel the rule is running successfully. Ideally I’d eventually like to make each line its own token to allow for formatting differences, but I need to be able to tokenise one first! The answer is certainly in my (total inability to) code - where am I going wrong?

set itemPath to quoted form of POSIX path of theFile
tell application "Finder" to set fileName to name of theFile
set scriptFrontPage to do shell script "/opt/homebrew/Cellar/xpdf/4.05/bin/pdftotext -raw -simple2 -f 1 -l 1 " & itemPath
set scriptTitle to do shell script "echo " & quoted form of scriptFrontPage & " | sed -n '1'p"
set scriptEpisode to do shell script "echo " & quoted form of scriptFrontPage & " | sed -n '2'p"
set scriptWriter to do shell script "echo " & quoted form of scriptFrontPage & " | sed -n '3,4'p"
set fileName to do shell script "echo " & quoted form of scriptTitle
return {hazelExportTokens:{fileName}}

Hi

Just a thought on this.

Given the inconsistency of the formatting for the script naming etc. An alternative suggestion, if you have an AI LLM service, would be to insert a step into the process that asks the AI model (using shortcuts or python depending on the model) to mine the script for the data you are looking for and then format it into:

‘Film Title - Writer( - draft date, where available).pdf’
‘TV Show - Episode Number - Writer( - draft date).pdf’

Using a prompt that looks like –

“ examine the first three pages of this PDF. It is a film or TV show script. Extract from these pages the title of the film or TV show, the author or writer of the script, the episode number and where available the date it was drafted. Once you have this information format it as follows and return this as text in a single string.

Film Title - Writer( - draft date, where available).pdf“

In my experience with the AI models this won’t be 100% consistent but may be better than REGEx or other methods depending on the variance between the script naming conventions.

Hey @JSW, thanks for this suggestion - and apologies, as you’ve stepped on a landmine while trying to help!

Anything AI is an absolute no-go for this. There were some strikes about it a couple years ago, you might recall :slight_smile: These scripts are generally uploaded in good faith as library references for writers such as myself to read, and invariably with the explicit stipulation that they not be fed into AI/LLMs. I’m being very careful about how I OCR them to the same ends (though generally I’m finding they already have been OCRed).

So while I’m sure it would make this particular task much easier - it will have to be the long, tedious trial-and-error ‘old school’ route. However much it makes me bang my head against a wall in the meantime!

For anyone interested and as I was asked before, the BBC Script Library has plenty of script examples to see formatting. (And also an example of such stipulations at the bottom of that page :wink: )