comrak Invalid rendering of a list followed by another block

We're using comrak to build a markdown document programmatically and output it to a file (as markdown). At some point, we produce a list of items, where each item is a code snippet contained in a NodeValue::Code.

The NodeValue::Code nodes are directly added to the list item's children, which are themselves added to the NodeValue::List node. Then some other content is appended.

When rendered with format_markdown, comrak doesn't properly insert a new line as a separation between the list and the next element. If the next element turns out to be e.g. a Paragraph, then it is rendered as:

- `foo : Number`
- `foo | Even`
rest

instead of

- `foo : Number`
- `foo | Even`

rest

This is an issue, because the first one is different semantically, being equivalent under the commonmark spec to:

- `foo : Number`
- `foo | Even` rest

Here is a debug print (in comrak 0.17.0) of the AST being invalidly rendered:

test2.log

Looking at the rendering code in cm.rs, I found that strange bit: https://github.com/kivikakk/comrak/blob/26ad7543a512b1b07f7efcbcc49aca96f8a9c0ae/src/cm.rs#L407-L421

It turns out the renderer only inserts a newline after a list when the next node is code or another list. I don't really see why: we should always insert a new line I believe (excepted when the list is the last element of the children of a block).

However, the bug above doesn't occur when parsing and re-emitting markdown from a source file, e.g. with the comrak binary. After some investigation, it seemss that comrak always parse list items as NodeValue::Paragraph. Paragraph does insert a new line after itself: https://github.com/kivikakk/comrak/blob/26ad7543a512b1b07f7efcbcc49aca96f8a9c0ae/src/cm.rs#L564

However, when the list is tight, I think this bunch of code cancels out the newline inserted by paragraphs: https://github.com/kivikakk/comrak/blob/26ad7543a512b1b07f7efcbcc49aca96f8a9c0ae/src/cm.rs#L100. I'm not 100% sure, but I feel like for some reason the last item doesn't (although the in_tight_list_item doesn't seem to depend on the position of the item in the list). Or some other mechanism which let the self.need_cr to 2 and properly insert the required line.

In any case, one possible workaround is to always wrap the content of list items in a NodeValue::Paragraph. However, it feels a bit fragile, and I think the renderer should always emit a new line, instead of relying on the unspoken invariant that list items are wrapped in paragraphs to function properly.

If maintainers agree with the baseline, I can tentatively submit a patch.

Apr 03 '24 17:04 yannham

Hiya! :heart: Thanks for this comprehensive report; I'm pleased to see more and more folks using Comrak for Markdown manipulation and not just HTML rendering!

Unfortunately, I don't have time to look into it fully at the moment; I'm moving overseas in a couple days. If any others would like to chip in (or compare/contrast with cmark-gfm's behaviour), that'd be awesome!

Apr 04 '24 01:04 kivikakk

@yannham

A couple things I noticed about your AST, which you alluded to

one possible workaround is to always wrap the content of list items in a NodeValue::Paragraph.

When I look at an AST that is normally built, each item is wrapped in a paragraph, whether the list is tight nor not.

So for example,

- one
- two

generates

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">
<document sourcepos="1:1-2:5" xmlns="http://commonmark.org/xml/1.0">
  <list sourcepos="1:1-2:5" type="bullet" tight="true">
    <item sourcepos="1:1-1:5">
      <paragraph sourcepos="1:3-1:5">
        <text sourcepos="1:3-1:5" xml:space="preserve">one</text>
      </paragraph>
    </item>
    <item sourcepos="2:1-2:5">
      <paragraph sourcepos="2:3-2:5">
        <text sourcepos="2:3-2:5" xml:space="preserve">two</text>
      </paragraph>
    </item>
  </list>
</document>

Now I'm basing this off of the AST output at https://gitlab-org.gitlab.io/ruby/gems/gitlab-glfm-markdown/?text=-%20one%0A-%20two%0A. But that should be accurate, and in fact when I look at the html.rs code for a paragraph, it's specifically checking for tightness

https://github.com/kivikakk/comrak/blob/120a36cfd519e1490c555c4cf55f97c7a46c272e/src/html.rs#L679-L693

So I don't think that is fragile, but probably the correct way.

One other thing I noticed in the AST you provided is that you had two Document nodes, the second one wrapping the paragraph. I don't think you should need that.

wdyt?

Apr 24 '24 00:04 digitalmoksha

@digitalmoksha thanks for your input. Indeed, this is what I ended up doing - wrapping every list item in a paragraph. I guess my point is that it's not obvious when reading the common mark specification, for example their AST definition, that items must contain a paragraph. Now this is how comrak parses markdown in practice, but in theory inline text or inline code should be allowed as well. And this alternative isn't rendered properly.

Now I understand that this is more work to do to make the pretty printer compliant - as comrak is open-source and based on best effort, I think it's a reasonable answer to say "we always assume that items contain a paragraph node, and it's on you". But then maybe this invariant should be made clear somewhere; either in the documentation, or through the types themselves (by making it impossible to have an item without a paragraph somehow)? At least in my experience, it wasn't obvious at all what was the problem, from looking at the spec and at comrak types.

Yet another possibility is to wrap item content as paragraph on the fly when pretty-printing (no need to actually allocate a new node I guess, but just take the same code path as if the content was wrapped in a paragraph when it's not), so that with minimal change the pretty-printer would correctly handle a larger class of ASTs.

What do you think? I'm happy to help on either front - implementation or documentation.

Apr 24 '24 10:04 yannham

I'd make the observation that not every list item's contents necessarily shall be in a paragraph; e.g.:

- ```
  xyz
  ```

This produces the following XML AST:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">
<document xmlns="http://commonmark.org/xml/1.0">
  <list type="bullet" tight="true">
    <item>
      <code_block xml:space="preserve">xyz
</code_block>
    </item>
  </list>
</document>

But a code block is, like a paragraph, a block type. The spec and DTD are explicit — the spec not very clearly so, imo — that list items can only contain blocks, and therefore cannot contain e.g. a code inline. See § 5 Container blocks:

A container block is a block that has other blocks as its contents. There are two basic kinds of container blocks: block quotes and list items. Lists are meta-containers for list items.

A container block can only have other blocks as its contents, and list items are container blocks. As you've noticed, the DTD agrees:

<!ELEMENT item (%block;)*>

So I think when you say:

Now this is how comrak parses markdown in practice, but in theory inline text or inline code should be allowed as well. And this alternative isn't rendered properly.

By the theory of the spec, it should not be allowed, and it's just that Comrak's type definitions are too loose. The inline documentation is actually explicit about this too:

https://github.com/kivikakk/comrak/blob/56581d7275d8180f1a55771a2f3d41b6ebef26a6/src/nodes.rs#L42-L44

And NodeValue::contains_inlines does too, as it doesn't mention NodeValue::Item:

https://github.com/kivikakk/comrak/blob/56581d7275d8180f1a55771a2f3d41b6ebef26a6/src/nodes.rs#L419-L425

See also can_contain_type:

https://github.com/kivikakk/comrak/blob/56581d7275d8180f1a55771a2f3d41b6ebef26a6/src/nodes.rs#L614-L623

I mention these not to try to make my point loudly, but because these functions (derived from the spec) are important to how the parser ensures the correct types of things are nested (and breaks out of elements when necessary), and also how it chooses what and when to process.

Comrak parses things correctly for itself, of course, but it's not preventing users from creating a non-conformant document, with the result that its formatters then act out when supplied them.

The options I can see right now are (reworded for clarity):

Officially allow inlines-in-items and implement support for it.
Enforce no-inlines-in-items at the type level.
Verify no-inlines-in-items at runtime.
Add more documentation.
Do nothing.

My feelings on them are:

Changing what we accepted here would result in non-compliance with the spec, and would complicate the implementation for what I think isn't a good enough reason.
It would probably be quite difficult to ensure this at the type level, given how general the tree type is.
I think this fair game. We could add a completely optional function to fsck() or lint() or otherwise sanity check a Document (or any node), and we could even consider calling it by default on format in debug builds.
While it already is documented in a few places, it's obviously not clear enough, and not in the places users are looking! Specifically, we should have some documentation where you might have gone looking for it first.
I don't think this is right, given the amount of discussion we've been able to have about this already! This is too easy to become confused about for anyone.

Please let me know what you think!

Apr 24 '24 14:04 kivikakk

Thanks for the elaborate and considerate answer @kivikakk. I might have got confused by reading the AST definition, I thought custom_block was actually allowing inline elements to be part of block - as in an EBNF grammar - but in this formalism actually it's more like a constructor, and not a naked element. Thus, reading the spec again, I think you're entirely right.

I thus agree that you should then absolutely not make comrak non-compliant just for that use-case, it's not worth it (agree with feelings on 1.). I think 2. isn't that difficult, because block elements and inline elements are clearly delimited - but it would break backward compatibility (you could just have an additional layer of enum type Block and Inline classifying nodes, that is following a bit more blindly the provided official AST definition, although it's more cumbersome because it adds a level of indirection).

While it already is documented in a few places, it's obviously not clear enough, and not in the places users are looking! Specifically, we should have some documentation where you might have gone looking for it first.

This is a good point, although I don't really remember where I tried to look for this in the first place :sweat_smile: - I clearly missed the documentation that you point to, though.

I think this fair game. We could add a completely optional function to fsck() or lint() or otherwise sanity check a Document (or any node), and we could even consider calling it by default on format in debug builds.

I think this sounds good, also the same question arises: how to make users aware that they should run .check()? in the first place? So maybe the compromise of doing it by default on debug is a good one, I reckon (and I expect modern branch predictor to make the cost non-measurable for compliant ASTs).

Apr 24 '24 14:04 yannham

I think 2. isn't that difficult, because block elements and inline elements are clearly delimited - but it would break backward compatibility (you could just have an additional layer of enum type Block and Inline classifying nodes, that is following a bit more blindly the provided official AST definition, although it's more cumbersome because it adds a level of indirection).

It's a little more complicated than that. The node types you see in the nodes module don't actually carry the tree structure; they're provided by arena_tree, which doesn't know anything about the type of what it's carrying.

Even if we could solve that (which would mean changing the tree type entirely), there's still complexity in that some block types can contain only other blocks; some block types contain only inlines; some inlines can contain other inlines; other inlines may not have any children at all. There's the question of how far we'd want to go too; see can_contain_type for the full definition, but even the above is just a summary.

Without an exhaustive typed definition, there's always a bit further one could go, but it would also affect greatly how the parsing and processing goes, since the code (modeled after cmark) assumes it can traverse the whole tree with one type.

So maybe the compromise of doing it by default on debug is a good one, I reckon (and I expect modern branch predictor to make the cost non-measurable for compliant ASTs).

I think I agree!

A simple check can just run through the tree and ensure that can_contain_type(parent_node, child_node_value) is true for all children. Would you like to have a go at implementing it?

Apr 25 '24 08:04 kivikakk

It's a little more complicated than that. The node types you see in the nodes module don't actually carry the tree structure; they're provided by arena_tree, which doesn't know anything about the type of what it's carrying.

Ah, I see. I might have overlooked the complexity.

A simple check can just run through the tree and ensure that can_contain_type(parent_node, child_node_value) is true for all children. Would you like to have a go at implementing it?

Sure, I can take a stab at it!

Apr 25 '24 09:04 yannham

#425 now provides a method to detect the original offending case (validate()), and comrak does follow the common mark spec after all. so I suppose we can close this.

Jul 08 '24 21:07 yannham

Thank you!

Jul 09 '24 08:07 kivikakk

comrak comrak copied to clipboard

Invalid rendering of a list followed by another block

comrak
comrak copied to clipboard