How to process web site content presented as PDF “blob”?

I’m trying to do some basic web page processing in Shortcuts. Essentially, convert web page to PDF and then do other downstream file management. No problems for typical sites or PDFs.

I am getting tripped up on one site that renders a PDF “blob” version of the article. When I run the shortcut from sharesheet, I’m not able to get to the content. If I choose print from share sheet, pinch out to get PDF and the run shortcut from the print sharesheet, everything works fine. I would like to avoid taking the extra steps to create the PDF from the print sheet. Instead, I would like to run shortcut against original “blob” pdf created by the site.

I’ve tried all of the methods that I can think of to coerce a usable document (get article, make PDF, get contents, etc). I’m not familiar with this method of presenting PDF content on a website

Any insights?

Unfortunately, I can’t share an example as the site I’m using to generate the PDFs is behind a uid/password (Harvard Business Publishing for Education)

I imagine it’s obvious to others, but as this format is new to me I was not aware. When looking at page source on my Mac I see the output is being rendered via Javascript.

I’m assuming I’m out of luck with being able to get to the content other than through the Print path I describe above.

I would be interested to know if there are other options I could consider. Thanks

Yes it is possible to automate. Have a look here:

You should be able to use the script in the string from the last script that I’ve written there in the Execute Javascript in Web Page action and then followed by a Get Contents from Web Page action to convert the data URL into the PDF.

Let me know if it works!

Thanks for the tip. I’m in way over my head in terms of javascript understanding though I’m trying to puzzle through it. One clarifying question…

What URL construct should I use in the shortcut? The site is returning URL in the format of:

blob:https://hbsp.harvard.edu/6a3e4d1a-2403-4e

I’ve tried a variety of formats (with/without “blob”) but nothing is working. Depending on format I’m either getting a (No identifiers allowed directly after numeric literal) error or a (Unexpected token ‘:’. Expected ‘;’ after variable declaration) error.

Take the full URL that you get. And don’t forget the quotation marks " around it in the javascript:

let url = "<shortcuts variable containig the url>";

@schl3ck thanks again for the trying to help here. Still not working for me. See following error.

If it helps, current construct of the url is the following (result of the copy to clipboard step)

blob:https://hbsp.harvard.edu/2b03c983-fd58-4cf4-ba28-44f37b9db9ce

How could I forget that? :man_facepalming: Silly me trying to use await outside of an async function… (works only in Scriptable).

Thanks for not giving up!

The URL is exactly how it should be. And nobody can use it to get your PDF, because it references an object in your website session. If you close the website, you too have lost the ability to get the PDF. That’s how these blob URLs work.

Try this code:

let url = "<shortcuts variable>";

fetch(url).then(r => {
    blobToDataURL(r.blob(), completion); // pass the result directly to the completion function
});

function blobToDataURL(blob, callback) {
    var a = new FileReader();
    a.onload = function(e) {callback(e.target.result);}
    a.readAsDataURL(blob);
}

@schl3ck I do appreciate your help. Frustrating for me to do this blindly. I simply don’t have the expertise nor have I been able to fall back on Google as teacher. :roll_eyes:

Here’s latest. At least the error message is changing :rofl:

Thanks for sticking around and not giving up! I’m happy to help and will be even happier when it works :wink:

Do you run the action against the webpage containing the PDF from the share sheet? I ask, because the variable is empty in every screenshot:

image

I thought every time that you removed the variable because of privacy reasons and haven’t asked because of this.
If it is empty on purpose then that’s the source of the last error. It should be the shortcut input when running the shortcut from the share sheet in Safari.

Well, that was silly on my part. Fixed that.

Progress in that there are no more errors, however now there is no obvious output when I run from sharesheet. See linked example (though no interesting debugging info there)

I’ve added both a View Content Graph and Quick Look step immediately following the Run JavaScript action. Neither are invoked.

It’s worth noting I’ve scoured google results, stackoverflow and many other sites trying to figure this out on my own. A lot of time spent but obviously, to no avail. I feel like i need to come clean… I am trying… :smile:

Looks like I’ve to try on my own…I will write here when I’ve got something.

1 Like

I have bad news. I’ve tried it with different ideas. But it doesn’t work. The shortcut crashes every time when it tries to access the fetch function (even when I don’t call it, just reference it). It looks exactly like in your video. I don’t think it will work using XMLHttpRequest because both Firefox and Chrome gave me a CORS error on calling fetch. Basically this means that the code is not allowed to access the URL.
I’ve tried to prevent the browser to navigate to the blob URL and instead save it and then read it out with a different shortcut, but the latter one crashes too… and the best part about this was that shortcuts crashed for me every time when I wanted to delete selected text. Maybe because I use a third party keyboard called nintype, maybe a bug in shortcuts, maybe because I’ve only 1.5 GB of free space, I don’t know. But that’s not your fault!

I’m sorry.

Well that’s too bad. Thanks again for trying to assist.

Clearly the problem relates to your shortage of free space :wink:

Cheers! — jay

…that feeling when you check messages hoping to see a response from @sylumer :rofl:

2 Likes

I’m sure @schl3ck is way more familiar and skilled with JavaScript than I am. But I honestly haven’t taken a look because I have no way to reliably test any solution. Working blind like that can take a lot of time to work out a solution, if at all.

Not an issue through anything you’ve done, just the nature of the scenario.