Rich Text to Markdown loses paragraphs

Calion · February 16, 2022, 6:56am

I think I’ve found a bug in the Rich Text to Markdown action.

It takes this:

<p class="text" style="background-color: rgba(227, 191, 255, 1.00); border: 1px solid rgba(34, 34, 34, 0.10); box-sizing: border-box; display: inline-block; margin: -12px 8px 0;padding: 8px 12px;border-radius: 4px; ">
<span class="_580f17fcb5303652">This is a really good question, and Mill makes a good point, I think. Let me throw in my answer: Thales. Thales started by asking, &ldquo;What is the arche of phusis?&rdquo; and later thinkers didn&rsquo;t want to be satisfied with less profound answers. However, this isn&rsquo;t the whole of the answer. Another, perhaps better way to put this is that they wanted to know what the ultimate nature of reality was, and were too naive, in their young (in science and philosophy) innocence, that to make enormous conclusions from limited data was dangerous. Think of them as unsophisticated children.</span><span class="_26efaf6a4335c49e">

</span><span class="_580f17fcb5303652">Let me add something to this: What were their prototypes? Who could they look to for examples? Only religion. That was their only predecessor. And what else does religion do but draw enormous conclusions from extremely limited data? Besides, it would seem paltry and small to draw *less* profound conclusions than religion does; it would seem like a trivial, subordinate enterprise, rather than a new and profound way of looking at the Universe.</span><span class="_26efaf6a4335c49e">

</span><span class="_580f17fcb5303652">It&rsquo;s also kind of odd in that only a physicist would ask this question; a philosopher would not think of it. This is how philosophers *think.* I mean, this is what you *do* when you reason from first principles. Yes, they took logical leaps; they didn&rsquo;t know better yet.</span>
</p>

and turns it into this:

This is a really good question, and Mill makes a good point, I think. Let me throw in my answer: Thales. Thales started by asking, "What is the arche of phusis?" and later thinkers didn't want to be satisfied with less profound answers. However, this isn't the whole of the answer. Another, perhaps better way to put this is that they wanted to know what the ultimate nature of reality was, and were too naive, in their young (in science and philosophy) innocence, that to make enormous conclusions from limited data was dangerous. Think of them as unsophisticated children. Let me add something to this: What were their prototypes? Who could they look to for examples? Only religion. That was their only predecessor. And what else does religion do but draw enormous conclusions from extremely limited data? Besides, it would seem paltry and small to draw *less* profound conclusions than religion does; it would seem like a trivial, subordinate enterprise, rather than a new and profound way of looking at the Universe. It's also kind of odd in that only a physicist would ask this question; a philosopher would not think of it. This is how philosophers *think.* I mean, this is what you *do* when you reason from first principles. Yes, they took logical leaps; they didn't know better yet.

This is really frustrating to say the least! Is there a way around this issue?

kopischke · February 16, 2022, 7:32am

The action’s result is correct. HTML does not separate paragraphs on line breaks, these are used for legibility of the source only. In HTML, a paragraph is the contents of a <p></p> tag, of which your source contains exactly one.

Martin_Packer · February 16, 2022, 8:05am

And if it had been <div> elements instead of <span> the result would’ve been different. Possibly better, possibly not.

And - to that point - maybe a global find and replace before conversion would help.

steve1 · February 16, 2022, 11:51am

If the ‘span’s were ‘div’s in this example, that wouldn’t be valid HTML, or did you mean in general?

sylumer · February 16, 2022, 12:52pm

Why would that not be valid HTML?

As an experiment, I tried it in a W3C validator and it didn’t change the results swapping span to div.

I think Martin’s tangential point was that unlike a div, a standard span should not change the new line-based layout.

Martin_Packer · February 16, 2022, 12:54pm

The only thing I see wrong is div’s inside of p’s - and I’m not even sure that’s wrong.

drdrang · February 16, 2022, 1:34pm

I agree with everyone above that the output you’re getting is correct for the input. That doesn’t solve your problem, but it does indicate that the problem is not with Shortcuts—it’s with the program that’s generating the rich text (HTML).

If you tell us where you’re getting the rich text and show us a screenshot of what it looks like, maybe someone with experience with that program can suggest a workaround.

steve1 · February 16, 2022, 1:59pm

A ‘P’ tag is ‘flow content’ and can contain ‘phrasing content’. HTML Standard

A ‘DIV’ tag is ‘flow content’ and can contain ‘flow content’. HTML Standard

So a ‘P’ can go in a ‘DIV’ but a ‘DIV’ can’t go in a ‘P’.

sylumer · February 16, 2022, 2:24pm

Ah okay, those are the WhatWG standard rather than the W3C specification. Is there a validate you would recommend for those standards? I’ve only come across validation services that (eventually) come back to validity based on the W3C spec.

steve1 · February 16, 2022, 3:02pm

They appear to be one and the same — W3 links through to WHATWG.

https://www.w3.org/TR/html53/

(No idea on a good validator)

Calion · February 16, 2022, 3:19pm

Thanks; this is very helpful. If I’ve got to regex the source a bit before I convert it, I’m fine with that.

This is exported highlights from PDF Expert. What’s weird is that although there are paragraph breaks in the original comments, they don’t show up in the exported highlights when viewed in PDF Expert, but do show up when the rich text is imported into Obsidian! But I’d still prefer Markdown.

I can provide screenshots, but if someone can tell me how to convince the Markdown parser to accept these as paragraph breaks, that would fix the problem, I think.

steve1 · February 16, 2022, 3:57pm

It could be an issue with ‘\r’ vs ‘\n’ (vs ‘\r\n’) being interpreted differently. Maybe do a find/replace on that and then run through the markdown conversion?

Martin_Packer · February 17, 2022, 8:06am

It’s a pet peeve of mine that applications don’t provide much flexibility in how they export. But I suppose it’s a bit much to expect PDFExpert (which I also use so could have the same problem) to show extreme flexibility in how they export.