markdown-rs Stronger types

Hi. I started using markdown-rs and noticed I had to add a panic into my code just because everything is a Node, but List can only really contain ListItems. Since 1.x is still in alpha, I figured this would be a good time to speak up.

Have you considered making the AST types stronger? I think List.children should be Vec<ListItem>, and similarly e.g. Strong cannot contain a Paragraph or a Heading..

mdast supports this line of thinking:

type MdastContent = FlowContent | ListContent | PhrasingContent
https://github.com/syntax-tree/mdast#list says children are https://github.com/syntax-tree/mdast#listcontent which is just an alias for https://github.com/syntax-tree/mdast#listitem
https://github.com/syntax-tree/mdast#strong says children are https://github.com/syntax-tree/mdast#phrasingcontent

and so on.

I should be able to contribute code too, if you're interested and breaking the API is ok.

Nov 04 '23 22:11 tv42

Yes, supporting content types of mdast more exactly would be very nice! This is actually my main blocker for v1. If you want to work on that: great!

Nov 06 '23 10:11 wooorm

Having spent a day looking at the source, I'm less enthusiastic. Changing the resulting data structure is simple enough, but the parser is heavily built on the idea that all nodes are interchangeable, their correct layout happens mostly by the incoming events not being arbitrary, and there's a lot of mystic stuff there that's not obvious, like the context stack, "virtual steps", links to other events (all of previous, next, content; also seemingly into the future), etc.

I'm still looking into it, but I don't think this change will happen without either

rewriting the parser (but not tokenizer)
adding a post-processing step that constructs the final tree out of a temporary "anything goes" tree (which seems weird and wasteful)

I'm exploring what the parser would have to look like, but the undocumented details of the events are getting in the way.

How confident are you in the test coverage for the parser? It seems the tests are mostly in terms of to_html, with roughly one AST-structure asserting test per node type.

Nov 07 '23 21:11 tv42

yeah, markdown is incredibly complex. And the extension mechanism is also very complex.

The parser is very well tested. There are few node tests, just the ones for coverage, because the AST is slated to change

Dec 08 '23 16:12 wooorm

I was looking into this for some projects I am working on. I suspect that changing the parser is really hard, and in fact may not be a good idea overall.

An alternative might be to have a "higher-level" AST that is strictly typed, and let the library user decide which one to use. The high-level, strictly typed, AST would be generated from the low-level one that exists now, without changing the parser.

In fact, that is what I am doing in my project. I am trying to model my AST similarly to the pandoc AST types, but adapted with a bit of richer types.

If there is interest in this idea, and if my project works, I would by happy to contribute code.

Dec 21 '23 10:12 xelatihy

AST already exists, and is prevalent across the JavaScript ecosystem, but needs to be better modelled here in Rust: https://github.com/DefinitelyTyped/DefinitelyTyped/blob/master/types/mdast/index.d.ts.

Or, what do you mean with “higher-level AST”?

Dec 21 '23 10:12 wooorm