md4c icon indicating copy to clipboard operation
md4c copied to clipboard

Newlines between attributes break HTML block detection

Open manowicz opened this issue 6 months ago • 3 comments

The following HTML block is not detected as HTML in a .md file:

<img src="Documentation/Images/BasicHypergraphPlot.png"
     width="478"
     alt="Out[] = ... a plot showing 3 triangles connected at vertices labeled 2 and 4 ...">

It is detected as normal text data.

When newlines are removed, this block is correctly detected as HTML:

<img src="Documentation/Images/BasicHypergraphPlot.png" width="478" alt="Out[] = ... a plot showing 3 triangles connected at vertices labeled 2 and 4 ...">

Newlines between attributes should be ok.

manowicz avatar Jun 23 '25 20:06 manowicz

I think it could be a bug, unless the ellipsized alt attribute value you quoted, actually contains one or more newlines. Does it? In that case it violates the CommonMark spec, and MD4C would parse the entire tag as text instead of raw HTML.

Incidentally, the SO answer you linked does not apply to CommonMark. It applies to HTML in HTML browsers. For CommonMark refer to https://spec.commonmark.org/0.31.2/#raw-html.

step- avatar Jun 23 '25 21:06 step-

Thanks for correcting the spec to reference. The alt value itself doesn't contain newlines. Without the alt tag the situation is unchanged:

"<img src=\"Documentation/Images/BasicHypergraphPlot.png\"
     width=\"478\">"

is not detected as HTML, while

"<img src=\"Documentation/Images/BasicHypergraphPlot.png\" width=\"478\">"

is detected.

manowicz avatar Jun 24 '25 21:06 manowicz

Is that the actual raw HTML you're passing to MD4C? Both samples are invalid HTML. This is what the W3C HTML validator reports:

Error: " in an unquoted attribute value. Probable causes: Attributes running together or a URL query string in an unquoted attribute value.

At line 7, column 11

ody>↩<img src=\"Documentation/

If your markdown contains backslash-quote, MD4C will parse it as text because that's what it is, both in CommonMark and in HTML. Try replacing backslash-quote with single-quote. Here some tests you can try (Linux shell, hopefully it's clear enough):

md2html <<< '<tag attr=\"value\">'

Parsed as text because text is.

<p>&lt;tag attr=&quot;value&quot;&gt;</p>

md2html <<< '<tag attr="value">'

Parsed as raw HTML.

<tag attr="value">

md2html <<< '<tag attr="value"
attr="value">'

Also parsed as raw HTML.

<p><tag attr="value"
attr="value"></p>

Disabling the HTML feature parses everything as text.

md2html --fno-html <<< '<tag attr=\"value\">'
<p>&lt;tag attr=&quot;value&quot;&gt;</p>
md2html --fno-html <<< '<tag attr="value">'
<p>&lt;tag attr=&quot;value&quot;&gt;</p>
md2html --fno-html <<< '<tag attr="value"
attr="value">'
<p>&lt;tag attr=&quot;value&quot;
attr=&quot;value&quot;&gt;</p>

step- avatar Jun 25 '25 16:06 step-