Extracting a reference from a link URL - 2 different formats, either individually or from all links on a page

So the issue I’m trying to address is how best to be able to extract a reference number from a link URL.

Currently I’m right clicking, copying the link and pasting it into a TextEdit document and manually selecting the part of the URL I want.

https://www.domain.com/serviceX/reference/**this-is-what-I-want**/description

Though another site has similar content but with a different URL format

https://www.domain2.com/reference/serviceY/**this-is-what-I-want**

A single webpage may have 2-40 different link that conform to one or other of these formats.

What’s the easiest way to allow me to right click either on each link individually or ideally once on each page to compile either a TextEdit file or a Note with all the unique this-is-what-I-want values that are on the page?

The Mac that I’m most likely to run it on runs Catalina and so doesn’t have Shortcuts, but if Shortcuts ends up as the best way to collate the info I could use a Mac that has Shortcuts on it.

OK, so a little bit of a brute force ‘improvement’ to my current workflow.

Instead of copying the link and then pasting into a TextEdit file, I instead right click, add link to reading list, blitz through the links I need, then use the Reading List Exporter shortcut to save my expanded Reading List into a text document. Copy/paste the links I need into Numbers, then use the MID function to extract the 8 character reference that I need. This works for one of the URL formats. When I need to parse the other format I’ll need to use the RIGHT function instead.

Big improvement in time as the add link to reading list allows me to stay in Safari instead of going Safari - TextEdit - Safari

Shortcuts would definitely give you a good set of tools to take advantage of for parsing the page, but on the machine without Shortcuts, do you have access to any third party automation tools (e.g. Keyboard Maestro), or are you running a vanilla system in that respect? This will simply help clarify the baseline for your options.

1 Like

No, I don’t have Keyboard Maestro on any machine.

Hazel and TypeIt4Me are the 2 ‘power tools’ that I have.

Cheers

Okay, given the requirement, the constraints, and the information provided to this point, here is an Automator workflow you can use as your starting point to be tailored to your actual needs.

The workflow is set as a service (/quick action … Apple still seem to mix their terminology on this), so should be installed as such. It is set to receive text - though it does not really do anything with it, except to replace it.

There are three parts to the automation. The first gets the front most page’s URL from Safari. The next gets all of the URLs from that page. The third is a script that strips down the links to only what you want.

cat | egrep "domain.com|domain2.com" | sed -E "s/https:\/\/www.domain.com\/serviceX\/reference\/(.*)\/.*/\1/g" | sed -E "s/https:\/\/www.domain2.com\/reference\/serviceY\///g"

cat takes the output of the previous step and passes it on to egrep for processing.

`egrep filters out URLs for the domains you don’t want. If necessary, you could expand these matches out to be even more specific. You didn’t provide any actual examples pages to work with, so I simplified this part to make it a bit easier to explain.

egrep then passes on to the first of two sed commands. The first searches for the first format of URL and extracts the part you want using a regular expression pattern matched substitution. This output is then passed to the second sed command, which does the same for the second format of URL.

The result of the second sed is output by the service to replace the selected text.

If you are unfamiliar with cat, sed, or egrep, there are hundred if not thousands of articles, tutorials and man pages just a search engine’s use away, though it is useful to know that egrep is effectively an enforcement of grep -E.

Episode 34 is about regular expressions if you need to dive into those for the first time, and again there are many tutorials online and lots of useful references in this forum if you search for them.

With the above in place, if I select a suitable sample page (like this one I created with four links matching your formats and another that does not) in Safari, and highlight some text in TextEdit and select the example service provided above …

… I get an Automation pop-up, which I OK …

… and then the parts of the links you described as being required are output into TextEdit.

2023-08-26-18.52.08@2x

You could assign this a keyboard shortcut and then just trigger it in TextEdit to automatically insert the links you want from Safari.

There are many other ways you could tailor the automation too, but as noted above, the intention is to give you a viable starting point you can tailor to your needs.

Once you have modified the Automator workflow/service/quick action to work with the actual URLs you want to work with (rather than domain.com and domain2.com with serviceX and serviceY you should be good to go.

Hopefully, everything is now laid out for you so that with just a small amount of effort you should be able to amend what has been provided to your needs as set out in your post(s).

Enjoy.

Wow, thanks.

I’ll need to breakdown the elements of that pattern matching you’ve done for the SED command to be sure I understand it.

I’ll let you know how I get on.

OK, so I was able to adjust the workflow to match my requirements - rather than just grabbing all of the links from a page by expanding the match terms in the initial egrep command I was able to cut down the number of results dramatically to a much more usable number.

I’m currently having to open the workflow in Automator and run it manually - even though I have it copied into my ~\Library\Services\ folder it doesn’t seem to show up in any of my Services contextual menus, and it doesn’t appear as an option to enable in the System Preferences pane either - but this is certainly a workable solution for me - Thanks

Services appear under Keyboard shortcuts in the keyboard settings.

It should appear under text for enqbling/disabling.

If you select some text and then the services menu you should hopefully see it. This should match the selections I had in the screenshot I shared above of using the automation.

OK, so I added an action to append the results of the workflow to the current text document.

If I have the webpage I want processed selected and use Safari, Services, then no services apply. If I select some random text on the page and right click and select Services, then I get a selection of options but not this workflow.

If I go over to TextEdit, with a file open but nothin slected, then TextEdit, Services shows no service. If I selct some text and right-click and choose Services, the workflow is one of the options, but if I select it it only runs to the end of hte shell script and never carries out the Set Contents of TextEdit Document action which I added. If I don’t select anything and right-click then the Services option doesn’t show up.

However if I have Safari open at the page I want, and a TextEdit file open and run the workflow from within Automator, it adds the results to the end of the text file.

What’s different between running it from Automator and running it as a Service from TextEdit?

Fundamentally you’re mixing up two different ends of the service approach (either pull data from Safari or push data to TextEdit). Choose one approach or the other and don’t try to merge them.

In terms of service vs workflow, services are generally more accessible as you don’t need to open a file In Automator to trigger it.