Cutting Up PDFs

ufgraymatter · May 13, 2024, 11:00pm

Hey there,

I’m not looking for “split a 2 page PDF into 2 one-page PDFs.” I’m looking for the ability to define splitting a PDF into custom images or PDFs. For example. If I have a 10 page PDF, the 1st page needs to be cut at 3" down all the way across, and then again 4 more inches down, etc.

This is for custom reports that are created that I need to manipulate and import into other documents.

Thoughts on this? I know I could figure out a more “manual” way to do this with Keyboard Maestro simulating doing the work, but wondering if there was a more advanced tool out there!

sylumer · May 14, 2024, 6:06am

You could…

Split the PDF into separate pages using QPDF. Doing so then allows you to process each page as a separate file and so individually for other command line tools.
Process each page PDF using ImageMagick to convert it into an image. Images are much more convenient to deal with for cropping and inserting elsewhere. You can skip and page you don’t wish to use content from.
Then for each image, use ImageMagick to crop the image based on the appropriate coordinates and size, based on the size of the image produced in the previous step. You can specify different details for each page/image, so this k of it more as a sequence of instructions than a loop - because you are varying what to do with each page.

This could all be scripted via a shell script so you pass in the location of the PDF (or hard code it in the script) and then you can perform the required operations in sequence based on the generated file names and your knowledge of what crops you want to make.

Hope that helps.

ufgraymatter · May 14, 2024, 12:32pm

awesome! I will explore that. Sounds like a great way to do it

ufgraymatter · May 14, 2024, 7:08pm

Alright! I have it working. Just need to go page by page and pull coordinates to give to ImageMagick.

Next question. These images need to be added to a word document in specific places. Any thoughts on that?

sylumer · May 14, 2024, 8:18pm

Depends on your other existing content, but you could generate a Markdown/HTML/LaTeX format file with the images inserted and then pass it into Pandoc to generate the Word document from plain text source files and images.

If you need to work with an existing document (e.g. Because it has a live Excel Chart embedded in it) I’d hope to leverage bookmarks or unique placeholder text and some Word scripting (probably AppleScript) to carry out a set of content insertions or replacements.

While I love Keyboard Maestro, it would be my last option to build a macro to do the insertions by carrying out human interactions on the Mac. Thesr interactions are simply never as reliable as the fully script driven automations because they can succumb to timing issues and similar external factors that just nudge it off course enough to fail.

ufgraymatter · May 14, 2024, 10:32pm

Love the suggestions. I’ll explore those. Thank you!

gluebyte · May 15, 2024, 7:52am

This shortcut also splits, converts, crops (at different height each), and saves https://www.icloud.com/shortcuts/96fff72c620946e7adc800a7edaa1b99

sylumer · May 15, 2024, 11:31am

I think you would need to keep in mind the following for using Shortcuts for this.

You would probably want to use the custom crop. It allows you to remove margins, unwanted captions, etc. In a single crop rather than multiple crops to achieve the same result.
The PDF and subsequent image compression routines used by Shortcuts have a strong tendency to produce larger file sizes for output of equal quality than the command line options. Older command line tools have simply been optimised over time whereas Shortcuts has mainly expanded its range of functionality over time - such optimisations typically coming in through API inheritance.
Using a loop assumes every page has something to crop, so you may end up with redundant files/processing or adjusting the logic to only process particular pages. E.g. driving the conversion from a dictionary of page numbers with crop settings.

ufgraymatter · May 15, 2024, 4:41pm

Definitely agree with these points. I actually have all of my crops created and working with ImageMagick. Now it’s just a matter of going through and figuring out the markdown setup to create the word doc I need. I have that testing now and can pull in a manually inserted image using pandoc

ufgraymatter · May 21, 2024, 6:48am

okay - i’m stuck. I have almost all of it done but am stuck with something. I have a template markdown file that I can convert using pandoc to Word and it looks fine. Not perfect yet, but is decent.

I have a script taking that template markdown file and replacing some text in that file, and saving it as a new file.

The new markdown file looks equal fo the template other than some file paths for images. But using pandoc on the new markdown file leaves me with a word document that doesn’t have any linebreaks.

I can’t figure it out!

sylumer · May 21, 2024, 7:15am

In Markdown, a newline is included when you leave a blank line between lines or two spaces at the end of a line. Have you got these in place? It is something that is commonly missed.

https://pandoc.org/chunkedhtml-demo/8.2-paragraphs.html

You could also try setting --wrap=preserve for pandoc to tell it to explicitly follow your layout in the MD file.

ufgraymatter · May 21, 2024, 1:24pm

Yeah. Both of those are okay. Like I said, the template file works. When Apple Script opens the template, modifies it, and saves it, even with ask the same spacing, it doesn’t work. It’s like it’s not encoding it properly when it saves

sylumer · May 21, 2024, 1:47pm

Without details of source content, commands/parametre being used, annotat d output, and working with the examples, it is hard to offer targeted advice.q

Have you checked the file encoding of your files as you noted might be an issue?
Have you tried the markdown+hard_line_breaks option for pandoc?

ufgraymatter · May 21, 2024, 3:21pm

okay - so I did a simple test and copied the text out of the created md file to a new file in BBEdit and saved it and then converted that directly and it worked fine. So working through the process, I had to do a couple things to force the encoding of the file properly when it was being saved in Apple Script. It finally worked

ufgraymatter · May 22, 2024, 3:50am

okay. So huge blow to the whole operation. I got more documents to test the system on. They are all minorly different so all of the crop points need to change. Thoughts on best way to dynamically crop based on each document?

sylumer · May 22, 2024, 6:18am

What do you mean?

Do you have a number of variations that changr the crop locations in a standard way? For example, Monday and Friday versions of a daily file include extra sections which shifts everything by half a page. If so you would need a way to identify each variation such as from the filename or meta data (creation timestamp could help with a day based variation).

If the positioning is potentially different every time then I guess you need a fuzzier match plus some recognition of what is being extracted. As a human you understand which chart you might want because of its caption and context. You would presumably need some sort of AI tool to be able to do that sort of matching automatically and reliably. I have not come across one that would do this, but there could be something out there.

If the latter and you can’t find anything then you have a couple of options.

The first would need to come at this with an approach of you need human intervention to extract the areas to crop.

The second would be to contact the producer of the documents you are cropping and see if there iare any other formats they can provide for the content you want to include. For example, if they are generating the doc, are they doing so by including the content you want as images and they could share the images with you. Maybe they could be convinced to standardise layout locations for the content you want?

So in effect, the options above are based around these principles.

Stop the variations and standardise the source.
Account for the variations in the source.
Bypass the reliance on the current source.

Hope that helps.