Help parsing text and more from pdf

I have been trying to find a good solution for this for some time and have had some success with a long-winded approach but it fails a lot as it is very resource intensive.

What I’m trying to achieve…
Here in the uk, our local council shares the collection weeks for the different types of waste and recycling we have through a pdf file hosted online. This rotor varies through the year and I wanted to find a way of parsing the information in the pdf and converting the date information into calendar entries using shortcuts.

The url for the pdf is as follows…

Where I’m stuck and thoughts on solution
As the table contains the dates in text, I think it would be feasible to extract the text from the pdf to create a list of dates for the calendar entries although I’ve not yet perfected the RegEx to extract them cleanly.

The big challenge comes when I try to extract the information relating to the colour of the bin being collected on that date as this can only be identified by the coloured shape next to the date.

The only approach I could think of was to crop to a pixel where the shape is located and use the Toolbox Pro app to identifying the Hex code then move the cropped pixel over to the next cell and crop and and Sample again and repeat until I’ve collected all cells. This would give me a list of hex codes which I would then assign a colour name to, to describe the bin colour; blue, grey, black.

I’d be really interested to hear if anyone has any more straightforward approaches to this project.

I would propose considering the following.

1. Manual

Just manually translate those details once every 12 months. You don’t know when they will change the layout/shapes/colours, so any specifics you cater for this year may simply not work next year as there is no guarantee of consistency.

2. Accessible

All UK government have a requirement to provide information in accessible format. Such formats are generally more parse able by computers as that’s the principle that drives the accessibility software it is often fed to

For Trafford, their accessibility page describes that you must contact the department responsible to request it for anything inaccessible on the web site and it will not be immediately available as they create it for each request. Which hopefully is not true as that could be a lot of duplicated effort.

Now you may be asked to provide details of what you would require. Not everything is plain text output after all (audio, braille).

At this point there’s the question of is this approach a good request if it is a computer rather than human accessibility request? Is that actually subverting the intent and effectively asking someone else to do what you could probably do faster and better yourself manually.

3. Suggest

If nothing else, you could suggest that their PDF is not hugely accessible with the colours and symbols and that they should consider also publishing a simple weekly listing in pure text (date and collection type) on their website. That would give you a simple and consistent set of data to process going forward as well as the oucil then providing more accessible data, which will help fellow residents and save repeated effort from Council staff in (re-)producing more accessible content and sending it out.

2 Likes

I don’t think that it is unreasonable to ask a Council to ensure that all PDFs meet basic a
Accessibility Standards.

If it meets basic standards then it’ll be easy to grab any information we need.

As you’ve noted, that PDF is terrible for any reader and it fails almost every test in Acrobat’s Accessibility Check – which is an easy test for Council buyers to apply before approving the design and publishing stuff like this.

If the council publishes its calendar in iCalendar format, everyone can import it into the calendar app on their device and use the device’s built-in features to make it accessible to them.

Unfortunately not everyone necessarily has a personal device or even an Internet connected device. Local libraries are still a port of call for many in the UK to access council services and other online information. Many libraries still use time slot bookings as they remain so popular. It always surprises me to see how busy the public access suites still are in the libraries in my local area, but they remain a vital service for the community.

Yes, “everyone” went too far.

A public-facing Google calendar—or something similar—would serve those who have their own devices and those who don’t. My point was to try to steer the council into a format that’s already widely known and available rather than one of their own invention.

A clue-by-four applied to the council flunkie who said “I know, let’s create a web page so people can enter their postcode and download a custom-generated PDF.”

Good grief. Use an HTML <table>, people! Or, as Drang says, an .ics file which anyone can download to their smartphone. You’d think ensuring accessibility for blind users alone would be enough motivation to KISS… but, hey, your council tax at work.

(One only hopes they at least used automation to generate the PDFs, and didn’t build them all by hand. Still, inefficient.)

I suppose you could crack the PDF open in Acrobat and see if it has any additional accessibility content, but I wouldn’t hold much hope. While the PDF format supports such machine-readable annotations, almost no-one/no app ever adds them.

And while there are Python/Node/etc libraries for scraping raw information out of PDF files, it is a brittle PITA and could take a couple hours to get it working right. Simply not worth it for a once-a-year task, and will need rewritten if/when they redesign. Copying and pasting the table into an RTF file in TextEdit would preserve the colors, so you could probably extract all the info you need using AppleScript, but even that’s an hour’s work to write.

Ditto slyumer: doing the job manually sounds quickest and least painful. Copy and paste the table text into a plain text file, massage it into a list of dates that you can quickly add to your calendar, and note the garbage types manually. You only have to do it once a year; 10 mins work, plus another 50 minutes writing a polite letter to the council thanking them for publishing the information, but reminding them that blind users can’t read pictoral PDFs either and suggesting perhaps a simple accessible online HTML table and/or convenient downloadable .ics file is the way to go in future. Their taxpayers (who couldn’t care less for reading the rest of their waffle) will thank them for it.

1 Like

I ran the PDF through a text extraction API for PDFs, which then gave me all the dates.

If I simply select the option manually for the first date as to what bin type it is, the rest are cycled through in the same order consistently until the last date.

So, now I have the dates and bin types … to do whatever you want, including generating calendar entries, reminders, etc

Hi @sylumer,

Thanks for sharing your thoughts. I should probably add some context as to how I started here… I appreciate that the quest to completely automate the process is a little extreme and that I could effectively do it more quickly, manually. I’d taken this challenge on to see if I could push the boundaries of my shortcut skills.

My current working approach is to a hybrid as per your suggestion and I have broken it into 3 phases…

  1. Use matching to generate a list of dates (a little more complicated as I needed to stick the dates and months together).
  2. Iterate through those dates asking myself what colour bin was due (essentially manually setting it but giving myself a menu to choose the colour each time).
  3. Using this information to add each item to my ‘Bins’ calendar.

I’ll try to tidy it up a bit and share it once I’ve done a bit more work but this approach works and delivers on a fair amount of the original challenge (I’ve still not given up on the automated colour detection :slight_smile: )

In terms of submitting a request to the council, I think the most sensible request would be that they consider providing a .ics file or a subscribed calendar (possibly the same thing, not sure how they work tbh).

I think you raise some interesting points on accessibility in general too which does lead me to feel in pursuit of their goal to make it accessible, it potentialyl isn’t, depending on how you consume the information.

Interesting points on Acrobat Accessibility checks. I wasn’t aware of this and will take a look. I would like to think our local councils here are considering these guidelines but I would not be surprised if they simply don’t have the resource.

It feels to me like this is the ideal solution. Absolutely agree that ics cannot be the sole solution as many out there are probably even relying on the hard copy which comes through the letterbox every so often. It does feel to me like a calendar subscription shouldn’t be that hard to set up and maintain.

I suppose some of this depends on how many different calendars the council is trying to manage. That said, they have to manage it SOMEWHERE, so why not in a calendar then generate the PDF from there.

In the UK there is the BinZone app - on iOS at least. I’m not sure which councils support it. Mine certainly does. Might be an alternative approach.

Looks like a regional app to serve Oxfordshire but interesting to see all the same.

I’ve progressed this forward a little although it currently exists as 4 shortcuts which closely follows some the suggestion from @sylumer - thanks for your input.

  1. Uses url get action and uses some text matching to create a list of dates which I’m storing in data jar.
  2. A repeat action which loops through those dates and for each presents a menu in which I choose the colour of bin to be collected that week. These results are then also stored in data jar leaving me with a series of dictionary items with date and colour attributes
  3. The third step repeats through these dictionary items using the data to create a calendar entry with a colour in the event title.
  4. Finally or first to be totally accurate, there is a shortcut which gets all the entries in my ‘Bin Collections’ calendar and deletes them. This happens at the beginning of the process to avoid duplicates.

Not quite the full automation I set out to achieve up actually a very quick process to run through. The main effort remaining is to cycle through the 52 weeks with the pdf open selecting the relevant colour of bin.

I’ll see if I can get what I’ve done into a more shareable format.