Reference file here
The referenced PDF is a composite of a few screenshots illustrating the problem I’m observing.
My objective is to use the translated text from GoodNotes as an input to Workflow for additional processing.
Referencing the PDF by number, beginning on page 2…
-
Handwritten note. I lasso’ed the text, translated and pasted the results into the note. As you’ll see, that translation was accurate
-
Simple example Workflow. Called from the GoodNotes export share extension. You’ll note the immediate drop into the Workflow content graph
-
and 4. The Workflow content graph shows correct PDF render
-
Above the red line you’ll note the Workflow version of the translated handwriting (ie, the translated text passed into the Workflow from GN) is mangled. The individual words and symbols were translated correctly, but the text has been re-assembled in a non-sensical order.
Any ideas why?
Thanks in advance for any troubleshooting tips — jay
Rather than passing the workflow the whole PDF, maybe try copying the good text from GoodNotes and grabbing that from the clipboard in Workflow?
@ChrisUpchurch thanks for the input. I didn’t do a good job describing my overall workflow. The problem example I shared is a portion of a larger workflow I use for processing handwritten meeting notes.
I’m adding a portion to scan for lines flagged as to-do’s by using the ‘@@‘ prefix. As those lines are found in Workflow, new Reminders are created. There could from 0 to n lines spread through the full set of notes that need to be found. In other words, manual copy in GN would defeat the purpose of the automation!
PDF stores information in a text layer when available, but there are also positional elements to it. What appears as a single line on the page can (but is not always) a more complicated placing behind the scenes. Your using a non-linear capture application (you can write and drop content anywhere) so there’s probably quite a lot of that going on.
I’m assuming that page 1 of your example PDF is exactly what Goodnotes produced. If I copy the page content out into Drafts (a text only editor). I get the following:
@
@
test
Now
is
the
time
@ @
remember
the
milk
This
is
a
test
@
@
call
home
before
tonight
end
of
meeting
@@test Now is the time @@remember the milk This is a test @@call home before tonight end of meeting
This should be what Workflow is pulling from that page and might make some sense in regards to what you are seeing. It might be worth popping the copied text into a hex editor; I’ve not invested in one of those apps on iOS as I’ve never needed to view HEX on iOS before and the online versions are somewhat limited in my experience. I’m wondering if some of the spaces are not really spaces but being converted to that by drafts.
Actually, now I think about it, it should be possible to use Workflow’s URL encode action to compare the spaces in the string looking for ‘odd’ characters. Might be worth a double check.
Ah hah… the positional aspect of the PDF text makes perfect sense with what I’ve been seeing in various tests. Great tips regarding inspecting characters… I’ll definitely do that, if for no other reason, to fully understand what’s going on here.
Unfortunately, with that understanding, I need to re-think my approach to this workflow. Was so close to seemingly a great addition to my daily processes…
@sylumer thanks for taking the time to post the response — jay
The text, as it comes in to workflow is not in sequence Ie, the URL encoded text is sequenced in the same order as my prior examples. Assuming I’m using the URL Encode action correctly (very first step of workflow), I don’t see any offending characters. Just the text in the shuffled order.
%20@@%0Atest%0ANow%20is%0A@@%20remember%0AThis%20is%0Athe%0Atime%0Amilk%20test%0Athe%20a%0Abefore%20tonight%20of%20meeting%0Acall%20home%20end%0A@@%0A
Per your prior comments, I think this infers the app that is rendering the translated text has the logic to re-assemble in the expected format. GoodNotes does that. I just tested in PDF Viewer and it, too, does the same thing (presents the translated text in the correct format).
Will keep noodling on other approaches.
Any PDF viewer should render the PDF correctly. That’s the aim of PDF - to have a consitent layout across platforms. The issue is that there’s a text layer associated with it and whilst the text is there it contains different information about layout; i.e. spacing and line breaks for plain text.
That’s where I think this is probably breaking down.
A nasty hack might be to add in another character string at the end of a line. Think like sending a telegram and finishing each sentence with STOP.
If you were to add say $$$ at the end, and the OCR recognises that as three consecutive dollar characters an you then stripped all new lines in Workflow and then replaced all “$$$” entries with new lines in Workflow, that might help you get around it. But, as I say, that’s a pretty nasty hack.
(Probably not relevant in this case but) your “Any PDF viewer should render the PDF correctly” point, @sylumer, I’ve observed Preview sometimes has issues with rotated pages. (The documents we generate are mixed normal and rotated.)
I think PDFPen has the same problem - so might well be the same engine. It would be nice to know which PDF viewer uses which engine.
So, I have to disagree slightly with the expectation that PDF viewers always render correctly.
Actually you may note that I used “aim” rather than “expectation”. I’ve come across disparities across a variety of platforms in the past, but fundamentally all PDF viewers should aim to render the same PDF identically.
Here’s a note about the original aims:
Ref: The history of PDF | How the file format and Acrobat evolved
Hope that provides adequate clariifcation.
1 Like