Converting HTML to plain text in a Scriptable script

Hello. I am creating a script which uses WebView to obtain the HTML of some links that contain some content in which I would like to search for certain regular expressions, but I would like to convert the HTML to plain text to make the search easier. Might someone have any idea how I might go about doing this in a Scriptable script?

First couple of options here look viable for straight forward use in Scriptable.

Okay. Thanks so much.

I wanted to use the second method of the page suggested in the first reply above so I wrote the module and test script below. When I run the test script, I get the following error:

2021-12-26 22:37:29: Error on line 15:30: ReferenceError: Can't find variable: document

Line 15 is the following line in the module:

tempDivElement = document.createElement("div");

Anyone have any idea what I need to do to make the error message go away? Any help would be greatly appreciated. Thanks so much.

HTMLToPlainText.js:

/* Module which exports a single function to convert HTML to plain text
 */
let tempDivElement

module.exports.convertToPlainText = (html) => {
  // Check to see if the div element variable has been initialized
  if ( typeof tempDivElement != 'undefined' )
  {
    // It has been... remove the prevuously created element
    log('Removing previously created <div> element from document tree')
    tempDivElement.remove();
  }
  // Create a new div element
  log('Inserting a new <div> element into document tree')
  tempDivElement = document.createElement("div");
    
  // Set the HTML content with the given value
  tempDivElement.innerHTML = html;
    
  // Retrieve the text property of the element 
  return tempDivElement.textContent || tempDivElement.innerText || "";
}

TestHTMLToPlainText.js:

/* Test the HTMLToPlainText module
 */
var URLAlert, URLText = "";
var webView, html;

let HTMLToPlainText = importModule('HTMLToPlainText')

URLAlert = new Alert();
URLAlert.title = "Test URL";
URLAlert.message = "Please enter a test URL:";
URLAlert.addAction("OK");
URLAlert.addCancelAction("Cancel");
URLAlert.addTextField("Test URL");
let alertIndex = await URLAlert.presentAlert();
if (alertIndex >= 0) {
  URLText = URLAlert.textFieldValue(0);
}

if (URLText != "") {
  webView = new WebView();
  await webView.loadURL(URLText);
  let html = await webView.getHTML();
  let text = HTMLToPlainText.convertToPlainText(html)
  QuickLook.present(text)
}

Scriptable does not have a document object - it is not itself a web browser (hence no DOM).

You would need to get the WebView to do the document evaluation by injecting the code into it that you want to run (/r build your own page containg the HTML and your additional code) and then Scriptable would need to get the result from the WebView,

1 Like

I’m sorry. I’m kind of new to Scriptable and JavaScript (I am more familiar with C, C++ and Java). Maybe you can help me with a compile error that I am now encountering. I modified the module to feed the JavaScript to the WebView (the code is pasted below). Now, when I try to compile the module, I get the following error:

2021-12-27 11:51:35: Error on line 8: SyntaxError: Unexpected identifier 'webView'. Expected ';' after variable declaration.

Line 8 is the following line:

let text = await webView.evaluateJavaScript(javaScript);

Might you (or anyone else for that matter) know what I am doing wrong?

HTMLToPlainText.js:

/* Module which exports a single function to convert HTML to plain text
 */

module.exports.convertToPlainText = (webView, html) => {
  javaScript = "let parser = new DOMParser(); " +
   "let doc = parser.parseFromString(\"" + html + "\", 'text/html'); " +
   "doc.innerText";
  let text = await webView.evaluateJavaScript(javaScript);
  return text;
}

Somehow, by the grace of God, I was able to find a solution. I am posting the module and the test script that ended up working for me below in case it might help some one. Blessings from above to all.

HTMLToPlainText.js:

/* Module which exports a single function to return the plain text of the
 * HTML in a WebView.
 */
module.exports.getPlainText = (webView) => {
  return webView.evaluateJavaScript("document.documentElement.innerText");
}

TestHTMLToPlainText.js:

/* Test the HTMLToPlainText module
 */
var URLAlert, URLText = "";
var webView;

let HTMLToPlainText = importModule('HTMLToPlainText');

URLAlert = new Alert();
URLAlert.title = "Test URL";
URLAlert.message = "Please enter a test URL:";
URLAlert.addAction("OK");
URLAlert.addCancelAction("Cancel");
URLAlert.addTextField("Test URL");
let alertIndex = await URLAlert.presentAlert();
if (alertIndex >= 0) {
  URLText = URLAlert.textFieldValue(0);
}

if (URLText != "") {
  webView = new WebView();
  await webView.loadURL(URLText);
  let text = await HTMLToPlainText.getPlainText(webView);
  QuickLook.present(text)
}
1 Like

I’m wondering how hard it would be to take this code and convert it to writing the output as Markdown (with HTML for the bits that won’t convert).

Quite hard in general, but not too bad for limited elements, is my guess.

1 Like

I would not know offhand what converting from HTML to markdown in Scriptable would entail. I found the following question on StackOverflow that might help in trying:

https://stackoverflow.com/questions/1319657/javascript-to-convert-markdown-textile-to-html-and-ideally-back-to-markdown-t

Feel free to use the code as you would like.

1 Like

I’ve used turndown previously to convert HTML to Markdown, but that is a NodeJS module. It has a UMD version which might work in Scriptable. There is also a browser version so that would definitively work in the WebView if the UMD version doesn’t work.