Text Manipulation: Eliminating Hyphenated Word Break

For research notes, I frequently copy quotations from OCRed PDFs into a Scrivener page, so I set about trying to make a Keyboard Maestro macro that would clean up the quotes in my clipboard and then paste to match the Scrivener formatting style (see example text below).

The main action here is a shell script, which converts the extra newlines to spaces, and changes double quotes to single. That gets passed back to the system clipboard and is pasted to match destination style.

pbpaste | tr '\n' ' ' |  tr \" \'

This gets me most of what I’m looking for, but if I copy a line that ends with a hyphenated work break, the final product gives a word broken by a hyphen and space:

Before

Lorem ipsum “re-
lay dot fm slash
st jude”

After

Lorem ipsum ‘re- lay dot fm slash st jude’

So, I’d like to modify this to produce something like below, though I’m not sure of the best tool to: (a) match a hyphen only before a newline and (b) delete both the hyphen and the newline. Thoughts or any help appreciated!

Lorem ipsum ‘relay dot fm slash st jude’

I was going to do this all in Perl, but then I thought about shell quoting rules and chickened out on the " to ' translation, keeping your tr command in the pipeline to handle that.

pbpaste | perl -0777 -pe 's/-\n//g;s/\n/ /g' | tr \" \'

The 0777 switch is an old trick for slurping in the entire contents instead of reading it line-by-line, which is Perl’s default. The p switch tells Perl to print the result when it’s done and the e switch tells it that the next part will be the command(s). The rest is just a pair of substitute commands with g modifier to have Perl apply them globally instead of just at the first instance.

Be aware that getting rid of hyphens like this can sometimes mess up your text.

Meet my father-
in-law, Tim Cook.

becomes

Meet my fatherin-law, Tim Cook.

which isn’t what you want. Similarly, some line breaks are valuable, but you’re going to lose all of them.

3 Likes

Thanks, Dr. Drang! That’s a handy trick for multi-line pattern searching–my short search for something similar with sed came up empty.

Yes, since these are for notes at an early stage in my workflow, I’m OK with those trade-offs to cover the most common cases. Thanks for the help.

1 Like