Hazel Extract from / Parse HTML

Jonatan · April 21, 2021, 5:49pm

Hello all,

I have an HTML file from which I want to extract some information to rename the document.
Sadly I can’t just match the content, as I can only reliably find it in relation to its HTML tag.
It seems to me like there is no field in Hazel for me to access the raw file content.

I tried using Javascript to access the XPath of the element I want, but didn’t get that to work.
I then found a shell application that did what I wanted but then sadly found out that “Run shell script” doesn’t allow for a return value to be passed back to Hazel.
I wanted to work around that by using an AppleScript as an intermediary but my Script isn’t working as I can’t seem to figure out what data type the “theFile” value has, that Hazel passes into the script.

This is the script I was testing with:

on hazelMatchFile(theFile, inputAttributes)
	set title to do shell script "cat " & name of theFile & " | /Users/Jonatan/go/bin/pup 'html body table tbody tr td div:nth-child(1) table tbody tr:nth-child(4) td table tbody tr:nth-child(3) td table tbody tr:nth-child(3) td:nth-child(2) span:nth-child(1) text{}'"
	
	return {hazelPassesScript:true, hazelOutputAttributes:{title}}
end hazelMatchFile

I have no idea if I am on the right path here and am now stuck after having tried everything that came to my mind, so I would greatly appreciate any suggestions on how I could parse information from an HTML file.

Tjluoma · April 21, 2021, 11:27pm

Hi @Jonatan

I think the easiest way to do this is to have Hazel run a shell script on the file, and have the shell script rename the file based on the information that it finds.

Does the /Users/Jonatan/go/bin/pup command return the information that you want to use as the name?

If so, you ought to be able to do something like this (be sure to get everything from the first line to exit 0 when you copy/paste).

#!/usr/bin/env zsh -f

	# we want to define a `$PATH` that will include more than the default folders
	# that Apple includes. It does not matter if all of these folders do not exist.
PATH='/usr/local/sbin:/usr/local/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/local/bin'

	# `$1` is how the variable name for the first argument passed to the script
	# 	from Hazel, which should be the file name. I will assume you are using Hazel
	# 	rules to make sure this script is only going to run on files you want to check.
	# We are going to run the `pup` command on the file `$1` and save the result to
	# 	a variable named `$TITLE`
TITLE=$(cat "$1" | "$HOME/go/bin/pup" 'html body table tbody tr td div:nth-child(1) table tbody tr:nth-child(4) td table tbody tr:nth-child(3) td table tbody tr:nth-child(3) td:nth-child(2) span:nth-child(1) text{}')

	# if $TITLE is NOT (!=) empty ("") then we will rename the file
	# using `mv -n` (`mv` is `move` in unix) and the `-n` will make sure
	# that `mv` does not overwrite an existing file with the same name.

if [[ "$TITLE" != "" ]]
then
	mv -n "$1" "$TITLE"
fi

exit 0

Jonatan · April 24, 2021, 1:28pm

Hi @Tjluoma,

thanks for the help, that script worked wonders!
pup by it self turned out to not give me my desired output and I had to add some cleaning steps to make it look good and not contain any problematic characters.
So this is what I ended up with:

TITLE=$(cat $1 | tidy -utf8 -q -f /dev/null -c --wrap 0 | "$HOME/go/bin/pup" 'html body table tbody tr td div:nth-child(1) table tbody tr:nth-child(4) td table tbody tr:nth-child(3) td table tbody tr:nth-child(3) td:nth-child(2) span:nth-child(1) text{}' | sed 's/://g' | tr -d '\n' | tr -s ' ')

I noticed that I never mentioned what I am actually doing with this, so here a little addition:
I am using the HTML of the invoice emails Apple sends for every App Store purchase / Subscription renewal and I extract the title.
Thus combined with a second rule my file get properly renamed to, for example:
20210310-Apple-iCloud 50 GB Storage Plan.html

Now onto solving the second step, the pdf conversion, from my other post

Tjluoma · April 24, 2021, 4:48pm

Looks good! However, just a friendly note / word of warning.

Whenever you are dealing with a variable in a shell script, always put it in "straight double quotes" (straight ones, not ‘smart’ or ‘curly’ ones)

So instead of

	cat $1

you should always use

	cat "$1"

Otherwise you risk running into errors when there are spaces in filenames or paths.

Using "straight double quotes" will prevent those errors.

That will save you time and frustration later!

Jonatan · April 25, 2021, 7:00am

Oh right, thanks, fixed!