General text munging script for use in Quick Action (Service)

Alderete · August 19, 2020, 5:38am

I have long used John Gruber’s TitleCase.pl script to convert text strings to a proper title case.

The way I use it works like this:

Select some text.
Hit a keyboard shortcut.
Selected text is passed through John’s script, and the replacement text is pasted in place, replacing the selected version.

All of the hard work (special cases, etc.) is in the Perl script. The Mac automation is extremely simple, just a Quick Action created in Automator to run the perl script on standard input, and then replace the selected input text.

I also add a keyboard shortcut for the quick action, in System Preferences > Keyboard > Shortcuts > Services.

I’d like to create a similar script, for different text processing. The basic idea would be the following:

Select some text.
Fire the quick action on it.
The script takes the text, and uses regular expressions to match specific possible patterns.
- If it doesn’t match the first pattern, it tries the next, and the next, in a loop.
- If it finds one of those patterns, it uses a substitution pattern to alter the text in some way, returns that text, and stops trying to match patterns. (Think break in a loop.)

My specific use case is that I’m cleaning up a lot of old content that used multiple different formats for creating links. I want to select an existing link, run the quick action, and output a modernized version of the link.

Here’s two examples of substitutions I want to do:

Textile link to Markdown link
"([^"]+)":([a-z0-9/-_.,?]+[a-z0-9])(?i) ➜ [\1](\2)

"Regular" HTML link to Markdown link
<a href="([^"]+)" title="([^"]+)">(.+)</a> ➜ [\3](\1 "\2")

I would love to just re-use TitleCase.pl shell with some changes to handle the find/replace pairs, but that Perl is … I’m not a Perl guy.

This seems like such a straightforward and generally usable tool that I was hoping to find something ready-made here on Automators. But I didn’t find anything that looks likely.

I have no specific interest in using any scripting language to do this. Perl is great, Ruby is great, COBOL, whatever works.

I don’t really need help with the regular expressions part. My examples aren’t particularly solid, I just knocked them out to have something to write the rest of the script with. They can definitely be improved, but the problem I have is putting them into a script that can use them.

Anybody know of a good example of doing this?

k.a.ll.e · August 19, 2020, 12:25pm

Curious: why not use Find/Replace or another solution which can process all occurrences at once? Seems nicer than having to manually select each occurrence.

Also, can you post two examples of links in Textile and HTML?

Alderete · August 20, 2020, 2:40am

Reasonable question. Oh, how I wish I could dump the database to SQL and run a search-and-replace. I would not be here, I would be in BBEdit! (Or <your_favorite_editor>, I admit to using Sublime and VS Code and TextMate, too…)

For this particular “project”, I’m working my way through ~15 years of blog posts (personal), across several sites. And (because reasons) they’ve been neglected for many years, and there are a lot of other problems. I can’t solve all of them by search-and-replace. Every post is getting visited and revised, one at a time. Multiple passes! Fun! I wish it were not true, but I’ve arrived here after quite a few experiments.

But, totally separate from that, in my day job (tech writer) I regularly have occasion to edit other people’s material, or even my own. I’ve become very accustomed to being able to select, keyboard shortcut, title text is fixed. (You would not believe how many engineers cannot master Title Case, when they can master indentation and termination and other syntax rules. And honestly, why should they? They give me enough information, I take it from there, we’re both doing our jobs.)

And so, as I was working my way through these old posts, it occurred to me that many of the edits I was making were formulaic. One of them is fixing the years when I thought sentence case “worked” for post titles on my blog, for which I have an easy keystroke shortcut. Another is the links format, which were subject to whim, apparently. I can see the value of an easy keyboard shortcut for those, too. And there might be others.

But, totally separate from my personal use cases, I really do think that there’s a reusable pattern here:

Select text.
Keyboard shortcut.
Selected text is filtered through a programmatic transformation.
Selected text is replaced by the transformed version.

And so, I’m here, looking for the right shell into which I can pour my specific transformations.

(The TitleCase.pl script is a great example, and is so close to being usable as a great shell. But my Perl skills are ancient, and that Perl is so idiomatic, I can’t figure it out. Worst case, that’s where I’ll go, I can (re-)learn. But I’d rather not.)

Alderete · August 20, 2020, 2:45am

And, here are examples of specific text for each of my two transformations:

Textile

"Boulangerie Bay Bread":/blog/26

Transformation:
"([^"]+)":([a-z0-9/-_.,?]+[a-z0-9])(?i) ➜ [\1](\2)

Output:
[Boulangerie Bay Bread](/blog/26)

Plain HTML with title Attribute

<a href="http://sfgate.com/cgi-bin/article.cgi?f=/c/a/2002/10/27/LV146808.DTL" 
title="SF Gate: No Snooze, You Lose">some depressing article</a>

Transformation:
<a href="([^"]+)" title="([^"]+)">(.+)</a> ➜ [\3](\1 "\2")

Output:
[some depressing article](http://sfgate.com/cgi-bin/article.cgi?f=/c/a/2002/10/27/LV146808.DTL "SF Gate: No Snooze, You Lose")

drdrang · August 20, 2020, 1:11pm

While I appreciate your need to edit each file individually, @k.a.ll.e makes a good point about not needing to select each link within each file. Here’s a quick action that will change all the links in the selected text (provided they fit the regexes you gave):

Screen Shot 2020-08-20 at 7.44.28 AM

The text of the script is

while (<>) {
	s{<a href="([^"]+)" title="([^"]+)">(.+)</a>}{[$3]($1 "$2")}g;
	s{"([^"]+)":([a-z0-9/-_.,?]+[a-z0-9])(?i)}{[$1]($2)}g;
	print;
}

(You could also save this script to a file and call it the same way you call Gruber’s titlecase.pl.)

A few notes on Perl:

The while (<>) {…} construct loops through all the lines in the file and applies whatever commands are in the braces to each line in turn. Within the loop, the lines are put in a special variable, $., which is the default argument for the substitution and print commands within the braces. This is why you don’t see any arguments within the braces—they are implicit.
There’s no need to test for a match before applying the substitution. If there’s no match, there’s no substitution. This is why we’re able to put the substitutions one after the other.
Although Perl typically uses slashes to delimit the find and replace parts of a substitution, you can use other characters. This allows you to avoid escaping slashes in the regex pattern. I used braces, s{…}{ …}, because there’s a slash in your Textile regex and did the same in the HTML substitution for consistency,
Perl uses $1, $2, etc. in the replacement part, not \1, \2, etc.
The g at the end of the substitution command means “global.” It tells Perl to apply the substitution to every pattern in the line, not just the first one it finds.

The upshot is that although you can use this to change links one at a time, you can also select the entire text of a blog post and change all the links with one command.

Alderete · August 21, 2020, 9:25pm

Dr. Drang! What a delight, and an honor, to have you help me out on this. I should have remembered that you often have interesting discourses on regular expressions (Numbers!), and checked for ideas on your blog.

Thank you so much for the solution, which after testing on ~40 articles or so, is working very well. I did have one incorrect transformation when there were two links on one line, where the expression greedily matched the start of the first and the end of the second.

I think the right answer is to make the matching of text between <a> and </a> be any character that’s not a “<”:

s{<a href="([^"]+)" title="([^"]+)">([^<]+)</a>}{[$3]($1 "$2")}g;

Of course, I really should google for a more complete expression for matching link tags, I’m sure there are many better than what I started with. I’ll do that … well, another time.

Can I ask one (more ;–) question? In the Textile link matcher, what does (?i) do?

s{"([^"]+)":([a-z0-9/-_.,?]+[a-z0-9])(?i)}{[$1]($2)}g;

That is, a quick google tells me it turns on case-insensitive mode. But it’s at the end of the matcher expression. And as I read the linked explanation, it should only affect matching after it. So if it’s at the end…what’s it doing?

Again, thanks so much for providing this great solution to my original question!

drdrang · August 23, 2020, 3:03pm

I confess I didn’t check the regexes with any care. I just put them into a Perl script and did some quick (and inadequate) tests.

The problem with the HTML link is the (.+) between the opening and closing tags. As you’ve seen, the + is “greedy,” matching everything until the last </a> in the line. Your change will work. You could also use the non-greedy form of +, which is +?:

<a href="([^"]+)" title="([^"]+?)">(.+?)</a>

And you could do the same thing between the double quotes:

s{<a href="(.+?)" title="(.+?)">(.+?)</a>}{[$3]($1 "$2")}g;

As for the (?i), my understanding is that in some flavors of regex, it turns on case-insensitive matching for the whole pattern, even when it’s at the end of the pattern. Certainly you want to match upper-case letters in the URL, so it looks like the the (?i) is an attempt to do that. It does nothing in Perl, though, where we’d be better off deleting it and using the i flag at the end:

s{"(.+?)":([a-z0-9/-_.,?]+[a-z0-9])}{[$1]($2)}gi;

There’s another weird thing in the Textile link regex. The

[a-z0-9/-_.,?]+

part includes /-_, which I suspect was intended to match those three individual characters but actually matches the range of ASCII characters from slash to underscore. This range includes all the digits, several punctuation marks, and all the upper-case letters. It doesn’t include several valid URL characters (e.g., the hyphen) and may have worked for you in the past only because the links you were dealing with didn’t include those characters.

I looked at the specification, and it’s pretty complicated. But maybe your regex is good enough to handle the kinds of links you’re facing.

Alderete · August 23, 2020, 8:22pm

A ha! That’s where I was having problems with Textile links “stopping short” at the first hyphen, and leaving the rest outside the Markdown delimiters. I looked at that multiple times last night, and couldn’t see the problem. Now that you’ve pointed it out, it’s obvious.

I fixed a couple of other missing characters, and (with your bug fix), I currently have this for the Textile replacement:

s{"([^"]+)":([a-z0-9/_.,?#=&-]+[a-z0-9/])}{[$1]($2)}gi;

I definitely should finish the work to perfect the regular expressions, but this quick-and-dirty version worked well enough when I cranked through the last ~100 articles last night. I’m pretty sure this one script saved me at least a couple hours over the 300-400 articles I needed to process. I don’t see a way to leave a tip at leancrew.com, but I certainly owe you at least a beer, coffee, or beverage of your choice! LMK if there’s a way to do that.

Thanks again!

Alderete · August 23, 2020, 8:38pm

As far as the background of this project goes, I started a blog when my dot.bomb went under in 2001. It’s moved through two different blogging systems (monaural jerk ➜ WordPress 1.2), and many versions of WordPress up to 3.something. It changed sites several times, requiring different URL paths, before I wised up on not including the FQD. It’s been unceremoniously shoved from one hosting provider to another, twice. (Once with no notice of shutdown.) It’s gone through my various whims with regard to markup, and markup systems like Textile and Markdown. Evolving improvements to WordPress image handling. And the expansion of what I was trying to do, from a personal journal not intended for others to (self-indulgent) photoblog to technology instructions (mostly around iTunes and audiobooks).

I don’t think I found any posts that mixed both Textile and Markdown (because the formatter plugin I used forced you to choose one or the other), but I certainly had one or the other plus plain HTML, plus a weird “temporary” [span] thing I did to avoid having to fix the way I listed links in the monaural jerk days.

In all, while the one author (me) was consistent in my quirks for periods, because I was learning, and changing the focus of the blog, and the software was improving so rapidly, I was anything but consistent over time.

And then when a bunch of things broke with the last forced hosting provider move (2014?), I just lost my motivation, until recently. Processing the ~600 posts and pages into 100% Markdown is the first of several reformatting steps. I have to recover my screenshots and other images, and re-do the markup for all of them. And so on.

Not sure it’s all worth it, especially after reading my earliest posts. Not even interesting to me anymore, when they’re not cringe-worthy.

But, gotta do something while self-isolating at home. Netflix, etc. is great, my wife and I still like each other, and we have a very large supply of booze in our closet. But sometimes you want to feel “productive”, and I am not a handy person, so… I decided to get this project off my list!

Thank you all again!

k.a.ll.e · April 28, 2021, 5:44am

I’m curious, did you work your way through all of the posts?

Alderete · April 28, 2021, 5:54pm

@k.a.ll.e Well, I did finish processing all of the posts with this script, converting 98% of my posts and links to Markdown format, and other associated cleanup. Thanks to everyone who assisted with improving the script!

But I stalled there. After looking at every single post, while they’re all technically valid Markdown, I have a lot of hideous HTML markup for the different ways I manually included photos and graphics into my posts. I need to basically delete all of that markup, add the graphics back into the WordPress media library, and include them from there. It’s…a big task.

Which is complicated by the fact that (a) I’m not sure I have a local copy of all of the images, (b) images on the live blog are mostly broken, due to misconfiguration done by my previous hosting provider that carried over to the new one. (Forced migration #2 for the site.) Which was caused by their need to resolve problems that came from the forced migration #1… (c ) I don’t have [S]FTP access, and attempts at fixing that have failed. (d) Many of the screenshots really need to be retaken for today’s modern displays and resolutions. (e) I got distracted by other things.

So, while I still need and want to finish rebuilding the site, so I can move it to a new hosting provider in a graceful way (for once!), I’m currently just letting things drift along until I can find some more extended time to work on things in one go. (I am a lucky person that my job can be done 100% remote / work from home, given 2020. But it’s also been a bit of a curse, in terms of time taken from Real Life. With a little luck, and some effort at work, I should be able to take a vacation soon…)

k.a.ll.e · April 29, 2021, 8:58pm

Wow! Thanks for sharing. I hope you’ll find the time somehow.