Cleaning Text (hyphens and line breaks)

Recently I’ve been taking notes on academic articles that are pdfs (fun stuff). Sometimes I want to copy the article text into another tool (Obsidian or anything else). Of course the text is full of extra line breaks (at the column edge) and spurious hyphens (for words that wrap into the next column).

I would like an elegant way of:

  • removing all single breaks
  • when a removed break makes a word into “any- thing” then I want to remove the hyphen and space.
  • blank lines should remain as they’re likely between paragraphs

I purchased - TextCase.app in hopes it would support this and I can’t seem to find a way, with it.

Has anyone found an elegant solution? A shortcut would be ideal since I will use this on both MacOS and iPadOS. However failing that I could live with trick invoked via Keyboard Maestro.

I do this often enough that I’ve gotten pretty quick at doing it semi-manually. From memory (since it’s mostly muscle memory by now):

  1. In BBEdit (or another app with regex search), find all double line breaks and replace with 1+ characters that are unlikely to appear in the text (eg, &&&)

  2. Find all hyphen-line break-spaces and replace with nothing (ie, deleting the hyphen and the line break)

  3. Find all line break-spaces and replace with a space (ie, closing up the stray line breaks)

  4. Find all &&& and replace with two line breaks (to return to correct paragraph breaks)

One drawback of step 2: Some hyphen-line breaks are actually hyphenated terms, which break at the margin on the hyphen… so eliminating the hyphen is incorrect. However, most of those will show up as a misspelling (red underline in native apps), so they’re relatively easy to catch.

Except for fixing these incorrectly un-hyphenated terms, this could probably all be automated via KM or something similar. I’ve never gotten around to it.

I used to do this in MS Word too, so technically you don’t need regex search; but then you either have to copy-paste the line breaks or experiment to get it into the fields

1 Like

Have you tried Liquid Text? It’s a pdf note taking app that functions very differently. It might be worth an explore.

I would go with a Shortcut indeed, using the Replace Text action with the Regular Expression option enabled. @mlevison Would you mind sharing a piece of text that contains all of your use cases (including things that shouldn’t change), and the expected result?

Dead clever. For some reason, I hadn’t thought of regex. Funny 10 years of work on Unix in my early career (SGI Iris, HP/UX and SunOS) and I didn’t automatically think of regex.

@evensupposing I’ve tried LiquidText, there are many reasons it doesn’t work for me chief among them being stuck in proprietary format.

@sebastienkb from https://www.researchgate.net/profile/Steven-Condly/publication/227519897_The_Effects_of_Incentives_on_Workplace_Performance_A_Meta-analytic_Review_of_Research_Studies_1/links/5a8ff2e80f7e9ba4296a5edb/The-Effects-of-Incentives-on-Workplace-Performance-A-Meta-analytic-Review-of-Research-Studies-1.pdf?origin=publication_detail

This text:
This analysis indicates that the
longer the implementation of an
incentive program, the greater the
performance gains realized. Long-
term (more than six months) pro-
grams produce impressive gains of
approximately 44%. Intermediate
programs (one month to six months)
realize gains of approximately 30%,
and short-term programs (less than
a month), about 20%. It is not pos-
sible from the studies we reviewed
to explain this trend but there are
many possible reasons.

My attempt at correction:
This analysis indicates that thelonger the implementation of anincentive program, the greater the performance gains realized. Long-term (more than six months) programs produce impressive gains of approximately 44%. Intermediate programs (one month to six months)realize gains of approximately 30%,and short-term programs (less than a month), about 20%. It is not possible from the studies we reviewed to explain this trend but there are many possible reasons.

I’m guessing you’ll want at least a two-paragraph example.

Even if you don’t often except more than one paragraph at a time, on the occasions that you do, you want to make sure the line- real fix preserves paragraph breaks.

Good point about liquid text.

Don’t be alarmed but now I know you’re using Firefox :face_with_hand_over_mouth: Basically I tried opening the PDF in different ways (data below). Bottom line is that Apple Preview respects your original paragraphs, while browsers don’t. Once you have the copied version from Preview, you could define a non-regex replace of "- " to "" and you should be good to go.

Important note: While attempting this exercice I learned that most text replacement tools destroy Rich text. Unfortunately, when resulting in plain text, it means you lose all the titles/subtitles in your articles as they become hidden in the text (because technically there are no newlines in front of them).
If the loss of Rich text is an issue, then we could hack it with TextEdit to preserve rich text, but only using Keyboard Maestro, BetterTouchTool or the likes. Do you have one of these @mlevison ?

If you’d like to ask Firefox to open all PDFs directly in Apple Preview, you can follow the steps here: View PDF files in Firefox or choose another viewer | Firefox Help

If opening in Brave Browser:

This analysis indicates that the
longer the implementation of an
incentive program, the greater the
performance gains realized. Longterm (more than six months) programs produce impressive gains of
approximately 44%. Intermediate
programs (one month to six months)
realize gains of approximately 30%,
and short-term programs (less than
a month), about 20%. It is not possible from the studies we reviewed
to explain this trend but there are
many possible reasons

If opening in Firefox (same as your sample):

This analysis indicates that the
longer the implementation of an
incentive program, the greater the
performance gains realized. Long-
term (more than six months) pro-
grams produce impressive gains of
approximately 44%. Intermediate
programs (one month to six months)
realize gains of approximately 30%,
and short-term programs (less than
a month), about 20%. It is not pos-
sible from the studies we reviewed
to explain this trend but there are
many possible reasons.

If opening in Preview:

This analysis indicates that the longer the implementation of an incentive program, the greater the performance gains realized. Long- term (more than six months) pro- grams produce impressive gains of approximately 44%. Intermediate programs (one month to six months) realize gains of approximately 30%, and short-term programs (less than a month), about 20%. It is not pos- sible from the studies we reviewed to explain this trend but there are many possible reasons.

Clever gameplay. The surface are examined was too small :-). I’m using the DevonThink PDF tool on both MacOS and iPadOS. I sometimes work for a whole day or more on he iPad. (Fredrick would think I’m mad).

I don’t use Firefox at all, since its performance isn’t great.

I’m trying to use ShortCuts so I have a universal action on both iPad and Mac. I accept that I will lose formatting this way and that’s ok. Having to retype ## occasionally isn’t a hard problem.

FWIW If I just do this on my Mac then yes I have KeyBoard Maestro and more.

On the Mac, I use TextSoap, which is designed for this. I believe it’s in Setapp and it allows you to customize your cleaning process. (It’s probably just a pretty UI for regex.)

I would add a +1 for TextSoap