commonmark-spec
commonmark-spec copied to clipboard
Broken HTML parsing
I have the following 2 snippets:
<script>
Foo
Bar
</script>
<div>
<script>
Foo
Bar
</script>
</div>
Which are rendered respectively as:
<script>
Foo
Bar
</script>
<div>
<script>
Foo
<pre><code>Bar
</code></pre>
</script>
</div>
On how to parse these the spec says, here:
Start condition: line begins with the string <script, <pre, or <style (case-insensitive), followed by whitespace, the string >, or the end of the line. End condition: line contains an end tag , , or (case-insensitive; it need not match the start tag).
This makes sense to me, pre
, style
and script
tags often contain empty lines and should be parsed correctly.
On how to parse the second snippet though the spec mentions that since the snippet started with <div>
then the exit condition for the entire thing will be an empty line.
How does this make any sense?
Why shouldn't script
's rule takes precedence for the lines wrapped between <script>
and </script>
?
This tracker is for bug reports. Discussion and questions like this should go to the forum at talk.commonmark.org. Feel free to open a topic there, after searching for existing relevant discussions. (But I think if you just read the whole section on HTML blocks in the spec, you'll find your question answered.)
Oh, I see you're not asking why in general a blank line ends an HTML block, but why the inner script tag's rule doesn't take precedence. Well, if you like you can bring this up in the forum. The rule is fairly simple; it won't automatically do what a human would expect in every case. We could talk about specific ways to change it (but please, nothing that requires unlimited backtracking).
@jgm How is this not a bug report since script
tags aren't parsed correctly?
The rule is fairly simple; it won't automatically do what a human would expect in every case.
I agree that it's pretty simple, but since no human would parse HTML in their heads that way I would consider it broken. Plus there's a specific rule for parsing script
, pre
and style
tags containing empty lines, but it breaks down pretty quickly.
I'm not too familiar with parsers to make a detailed proposal about this, but roughly I'd say the rule for script
, pre
and style
tags should just take the precedence inside their blocks.
Can we please reopen this?
@fabiospampinato As you noted, the HTML block started by <div>
requires a blank line to end it, and HTML blocks are leaf blocks (cannot contain other blocks) so the line <script>
never starts a HTML block.
@jgm & @fabiospampinato one idea for making this parse more "intuitively" might be to reassess the start condition on lines that don't meet the end condition and adjust the HTML block type accordingly in some order of precedence. E.g. if a subsequent line of a type 6 or 7 HTML block (which don't have, let's say "intuitive", end conditions in some cases) could start a type 1 through 5 HTML block then the HTML block will change to the respective type. Another way to phrase this might be that HTML blocks of type 1 through 5 may interrupt a type 6 or 7 HTML block.
@fabiospampinato if you open something over on the forums, we can discuss pros and cons of, and ideas for an adjusted HTML block definition there that bears this example in mind?
so the line
That's just a technicality of the spec, the rendered output contains a script
tag, and that script tag is being outputted precisely because I wrote <script>
in the Markdown I'm parsing, so I think it's reasonable to say that that <script>
text is in fact the beginning of a script
tag.
@fabiospampinato if you open something over on the forums, we can discuss pros and cons of, and ideas for an adjusted HTML block definition there that bears this example in mind?
Here you go: https://talk.commonmark.org/t/improve-detection-of-pre-style-and-script-tags-containing-an-empty-line/3216
Well, I'll open this again here. Given that we're not doing a full HTML parse, there are always going to be cases where the spec doesn't "do the right thing" as a human would judge it. But this one might be relatively simple to handle in the way @aidantwoods describes. It would make the spec more complex, and it's an issue you can work around quite easily (just put a blank line between the <div>
and the <script>
, so that the latter opens up a new HTML block).
just put a blank line between the
and the
The current parser also messes up things like this:
<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi><mi>e</mi><mi>s</mi><mi mathvariant="normal">_</mi><mi>t</mi></mrow><annotation encoding="application/x-tex">tes\_t</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9250799999999999em;vertical-align:-0.31em;"></span><span class="mord mathdefault">t</span><span class="mord mathdefault">e</span><span class="mord mathdefault">s</span><span class="mord" style="margin-right:0.02778em;">_</span><span class="mord mathdefault">t</span></span></span></span></span>
I'm rendering some LaTeX with KaTeX, and inside the outputted html there are two underscores, which get erroneously interpreted as regular Markdown.
The HTML parser should really get fixed.
@fabiospampinato why is that HTML, from an extension, inside Markdown? Don't extensions typically extend when compiling everything to HTML?
@wooorm It depends on the extension, the raw markdown for that rendered HTML may look something like this: $$t_es_t$$
, you need to parse that before the Markdown compiler because otherwise it may mess it up, e.g. compiling _es_
to <em>es</em>
. The problem here is that the Markdown compiler can still mess it up in some cases because the HTML parser isn't adeguate enough.
get erroneously interpreted as regular Markdown.
No, get correctly interpreted as regular Markdown. This has always been the way Markdown handles inline HTML: the tags are passed through literally, but other stuff gets interpreted as Markdown. It's by design.
See https://johnmacfarlane.net/babelmark2/?normalize=1&text=%3Cspan%3Ea+b+c%3C%2Fspan%3E%0A
EDIT: HTML blocks are a different matter. They have always been distinguished from inline HTML, even in the original Markdown syntax description. Here commonmark does depart a bit from how things have usually been done, for reasons explained in the spec.
A better approach for you would be to use a parser that extends commonmark syntax with support for math.
EDIT: Alternatively, if this is display math, you can ensure that it is interpreted as a raw HTML block by surrounding it with <div>
... </div>
on separate lines:
<div>
<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi><mi>e</mi><mi>s</mi><mi mathvariant="normal">_</mi><mi>t</mi></mrow><annotation encoding="application/x-tex">tes\_t</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9250799999999999em;vertical-align:-0.31em;"></span><span class="mord mathdefault">t</span><span class="mord mathdefault">e</span><span class="mord mathdefault">s</span><span class="mord" style="margin-right:0.02778em;">_</span><span class="mord mathdefault">t</span></span></span></span></span>
</div>
Make sure this has blank lines around it and you'll have no trouble. Since you're generating the HTML in a preprocessing step, this should be easy to do.
@jgm Then I guess my point is that the spec should be amended if these sorts of situations are currently allowed.
I cannot imagine how anybody (except those writing the CommonMark spec I guess) looking at some Markdown text containing that HTML snippet, which doesn't even contain a newline, might want part of it to be parsed as anything other than raw HTML.
A better approach for you would be to use a parser that extends commonmark syntax with support for math.
Yeah I'll probably end up doing that, the current way I'm extending the language is too brittle.
EDIT: Alternatively, if this is display math, you can ensure that it is interpreted as a raw HTML block by surrounding it with
...on separate lines:
Interesting 🤔 I think this won't work for inline maths expressions though if surrounding blank lines are necessary. IMO the existence of these sorts of tricks are a symptom that the current logic for the HTML parsing is inadeguate, I would bet money more than 99% of Markdown users wouldn't know about this trick.
You are right that humans looking at this will correctly guess that the underscore is meant to be a literal underscore. But the project here is to create some simple enough rules that allow for efficient parsing by a parser that doesn't incorporate general AI. Inevitably there will be cases here and there where the rules give counterintuitive results by human lights. Pointing that out doesn't mean there's a problem with the rules.
Keep in mind that the ability to mix markdown and inline HTML goes back to the original Markdown syntax spec and should be kept. My own personal opinion is that this was not the greatest idea. I'd have preferred an explicit syntax for including raw content (see here for some of my reflections). But our intent here is not to create something completely new, but to specify and rationalize existing Markdown conventions.
I don't think one needs to solve intelligence to write a passable HTML parser.
We're not parsing HTML here; we're parsing a mix of HTML and Markdown (remember what I've emphasized above -- Markdown has always allowed them to be mixed). The trick is to determine which bits are meant to be markdown and which are literal. An HTML parser won't tell you that. A human can often guess what is intended, but that's an exercise of general intelligence.
Sure, but that latest issue I stumbled on seems a tractable and somewhat common special case that should be improved upon. Trying to distill some of my general intelligence into an actionable rule:
- If during tokenization we encounter an HTML tag that is not within a code block and that is not self-closing everything from that point onwards on the same line will be considered raw HTML, until a matching closing tag is found.
Does that sound like a hypothetically implementable rule that would improve the current situation?
@fabiospampinato A “feature” of markdown is that you can do, for example foo <ins>*bar*</ins> baz
, which understand the HTML as HTML, and the markdown (*bar*
) as markdown, so emphasis. There are many cases out there which use this, so it would be nontrivial to change (and in my opinion, not a good idea)
@wooorm I don't think I've ever seen that used in the wild 🤔 It'd be interesting to know what percentage of people are actually doing that.
If this can't be changed because of backwards compatibility then I'm out of proposals.
I suppose I could hackishly wrap all the HTML I'm outputting in <x-html-shield>
elements, base64 encode the inner HTML, and then do the inverse right before outputting the compiled string. Or I could implement this "properly" by extending the reference CommonMark compiler 🤔 I'll have to play with that.
If this can't be changed because of backwards compatibility then I'm out of proposals.
Correct. This has been part of Markdown since the beginning.
Can we reopen this issue since the original issue reported hasn't been addressed yet and it looked like there might have been some movement there? i.e. @aidantwoods proposed something.
That is an interesting proposal, but it wouldn’t change your case of inline math. I think that proposal needs more discussion, as in, on the forums (I foresee problems with it)
@wooorm Yeah that's unrelated to the inline HTML issue I reported recently, but it will fix issues with wrapped script tags for instance.
@aidantwoods What is your view on this?
<div>
<script>
Foo
Bar
</script>
*is this emphasis*?
So if this is desired behavior, how is a person supposed to represent a block of code as a figure with a caption?
<figure>
<pre><code>
def sum(values):
if not values:
return 0
# Recurse!
return values[0] + sum(values[1:])
</code></pre>
<figcaption>A recursive Python function</figcaption>
</figure>
Because the blank lines signal the end of the <figure>
HTML block, the remaining lines are interpreted as Markdown and I end up with superfluous <p>
tags and the comments are interpreted as headings.
@darthmall You can try something like this, which should work for your use case:
<figure>
```python
def sum(values):
if not values:
return 0
# Recurse!
return values[0] + sum(values[:1])
```
<figcaption>A recursive Python function</figcaption>
</figure>
If you need more control over the generated <code>
and <pre>
tags you are out of luck because despite what most Markdown users would think you can't just write HTML inside Markdown and expect it to work.