Parsing PDF - comma sometimes there in PDF file and other times not

iSilentP · February 12, 2021, 2:39am

I just upgraded to the v5 Hazel thinking that I needed to start using lists and tables to solve for this. I can not figure out how to solve this.

I need to parse a PDF statement that has a value which sometimes is the hundreds and other times in the thousands. The difference is that the thousands value has the comma “,” as a separator (1,903). In the instance where the value is in the hundreds, no comma.

Any idea of how to get this to work would be appreciated.

Comma (thousands) sample:

Non-comma sample:

sylumer · February 13, 2021, 10:46am

You say that you are looking to use Hazel to parse the PDF, but could you describe what you are attempting to do with the parsed information? Are you renaming the file? Are you using it to determine where the file should be moved to? Are you intending to use it with something outside of Hazel; logging to a text file, or database perhaps?

Fundamentally if you have already extracted the numeric portion (e.g. using a regular expression group like ([\d,]*) you may simply be able to use a script to carry out a find and replace (substitute comma with null string) to turn it into a numeric value, or some string to number parsing. But I think what you can do and where it can go for an approach is guided to a large extent about what it is you are doing so far and looking to achieve.

The more information you can provide, the more likely it is you’ll get an optimised answer.

Hope that helps.

iSilentP · February 13, 2021, 3:08pm

Thank you for the reply. After MANY attempts, I figured out through trial and error with the “Rule preview” functionality.

Since you asked, I was trying to extract a date (mm/dd/yyyy) and numeric field (123 or 1,234) from an invoice and then use those two values to rename the pdf file as in "yyyymmdd 123/1,234). The problem I was having was trying to match the pattern of the optional comma when the number reached past 999 (as that’s when the comma started throwing me off). I ended up creating two different rules, one for each case, and that is working for me. Thinking that both rules are XOR (either one or the other will run, but not both) on a given PDF file.

I can only assume that my approach is very “brute force” and that there’s more elegant way to accomplish what I need, but I am satisfied with the results. When I have more time, perhaps I’ll try to find a more efficient solution.

FWIW: I asked same question on the Noddlesoft board and the reply was “For now, you’ll have to use separate patterns to match each case. I’ll consider having options in the future where it can handle thousands separators.”, which I didn’t completely understand but the word “separate patterns” I took to mean separate “rules”, which got me to the solution mentioned above.

Thanks again for your response.

~ Tony

RosemaryOrchard · February 13, 2021, 6:08pm

If you hold Option before you click the plus it should turn into an ellipsis. Then you can have a subset of rules - and use or on those.