commonmark-java Add roundtrip parsing/rendering

Fixes #12

Until now, this library could parse Commonmark into other formats, but not back into Commonmark format. This is understandable, as the Commonmark spec itself uses conversions to HTML to depict compliance. However, with the new CoreCommonMarkNodeRenderer it becomes possible to render Commonmark once it has been parsed - a "roundtrip" rendering. Usage is identical to HtmlRenderer.

KNOWN ISSUE:

If using IncludeSourceSpans.NONE (the default) or IncludeSourceSpans.BLOCKS, 4 test cases related to link reference definitions (193, 195, 198, and 217) will fail. In all cases it's due to line breaks, this is the only data structure where I couldn't find another way (besides SourceSpan) to track the line number if newlines occur. Use of IncludeSourceSpans.BLOCKS_AND_INLINES is recommended for 100% Commonmark roundtrip compliance.

DETAILS:

This is a relatively large PR. I have attempted to keep backwards compatibility whenever possible, but it's uncertain whether this update may break extensions to the core specification. Core specifications are confirmed to all pass.

In order to achieve this roundtrip functionality, new fields have been added to the various Commonmark nodes, and many changes to parsers have been added to populate these new fields. The commonmark-java library is geared towards conversion to HTML already, so I have chosen to keep that functionality and its optimizations intact and build extra structures on top of that approach rather than trying to rewrite what has already been done.

The one change which departs from existing functionality is the addition of a new BlankLine node type. This node represents a concept explained in the Commonmark spec (blank lines), but it is not truly defined in the spec as a Commonmark block. BlankLine should be treated as SourceSpan already is - a utility class that helps to describe the structure. (Speaking of SourceSpan, I have elected to keep them tracking the HTML wherever possible rather than the raw text in order to avoid breaking changes.)

The BlankLine class is specifically ignored in HTML rendering and is only currently used by the roundtrip rendering to fill in gaps which otherwise can't be determined from the parsed AST tree. In this way, existing plugins can render to other formats the same as they always have, but will also have the option of rendering back to Commonmark.

Aug 25 '21 18:08 seii

Hey! Just wanted to let you know that I haven't gotten around to review this yet, but it sounds promising :)!

Use of IncludeSourceSpans.BLOCKS_AND_INLINES is recommended for 100% Commonmark roundtrip compliance.

I think that's reasonable. As long as without the option, we can produce equivalent output (but not source-identital).

Aug 30 '21 06:08 robinst

I'm glad it looks promising so far, just let me know if there's anything I can help with. :)

I think that's reasonable. As long as without the option, we can produce equivalent output (but not source-identital).

Yes, that's the case right now. The 4 test cases that fail would still render all other details of the link reference definitions (including most whitespace), they would just be missing any line breaks inside the link reference definition itself.

Aug 30 '21 13:08 seii

All individual comments addressed, a couple will need your comments.

1. I might have missed it but where are the tests?

Other than updating some of the internal tests where they had broken, I hadn't added new ones because most of what I have done are edits to existing work. I do see that I'm missing a test for the new renderer, but in my mind it also happened to be thoroughly tested by making sure all CommonMark tests cases are rendered correctly. (I also figured it would be a bit overkill to include all 652 CommonMark tests.) If you have suggestions for where you'd like to see more test coverage I will be happy to write them and push an update.

2. Have you run the benchmarks yet and compared to before your changes? I wonder if it makes sense to put the code for this behind an option (which would also automatically enable source positions). I'm not sure it's worth it, just thinking.

Other than the output for how long individual existing tests took, I couldn't see an obvious benchmarking method. Is there a wiki or document for how I could achieve this? I'm open to running the benchmarks, I just want to make sure that whatever I provide is an apples to apples comparison.

3. What are your thoughts about users that would like to generate an AST using code and then render to CommonMark? In that case, with the current implementation, users would have to use `setRaw` to get the desired rendering, otherwise they'd get invalid CommonMark, right (i.e. just `setLiteral` wouldn't be enough).

I think generating an AST directly, without going through parsing, is already risky without my changes and would continue to be about the same level of risk. Unless someone uses a DocumentParser.parse method in some way, they are skipping pieces of parsing logic. Even manually generating individual type parsers for some limited sanity checks would still be messy to stitch together.

My guess is that most people attempting to generate an AST would be directly making a tree of nodes. This means they have to be very confident they know things per node like the correct literal values, when lists are tight, when Text is in a Paragraph and when it's just by itself, etc. After my changes, they would also need to have confidence in where certain whitespace is and make sure that setRaw is used in addition to setLiteral. There's a little more that could go wrong once roundtripping is available, but I think someone encountering a problem with direct AST creation is already in the Land Of Dragons to begin with. 🔥 🐉

4. In general, I feel like it's a lot of tricky code additions. If you have any ideas on how to condense some of the parsing code, that would be great. Do you think something like `Scanner` for block parsers would help?

I agree that much of what is changed in the parsers, at least, is tricky. commonmark-java is not, from my outsider observations, a library that aims to natively roundtrip and also just happens to render to HTML. Rather, I encountered a lot of places where rendering to HTML is clearly its native focus. I'm dancing on the edges of that code and trying to capture information that was often not intended to be captured by an HTML-first design. I'm taking it as a large compliment that you haven't outright called this PR an ugly hack. 😃

As for ideas, I will say that I can't think of much which would also preserve this library's original intent. It would likely be much cleaner to start with a roundtrip object and split out what the HTML needs (as HTML rendering is usually heavily distilled from the raw). That would also probably be a ton of work, starting at DocumentParser.parse and going all the way through. On the plus side, at least if that ever happened the POJOs you needed for each data structure are already done... just sayin'!

(To answer your specific question, I think a Scanner for block parsers would be only partially useful from this PR's perspective. CommonMark seems to recommend parsing in a single pass, which means a scanner would be only able to deal with a single block and wouldn't have a good grasp of any relationships between blocks. It might, however, be able to identify the blank lines between blocks in a more intelligent fashion. If a block-level scanner parsing through raw contents knew that a previous block existed there would probably be ways to capture the blank lines between "previous" and "the current block".)

Sep 06 '21 19:09 seii

I believe all initial review comments have been addressed for now. I was able to successfully run a mvn clean package -P release on my local machine, which also helped in cleaning up several items that caused some of the extensions to fail tests.

No rush, just making sure I close the loop on your original review.

Oct 05 '21 15:10 seii

is there someone active on this? I'd really like to see the roundtrip feature deployed

Jun 14 '22 10:06 ToWi87

@ToWi87 I did not keep this PR in sync with the main repository, as it seemed like the PR wasn't going to be accepted.

Jun 14 '22 18:06 seii

is there someone active on this? I'd really like to see the roundtrip feature deployed

Seconded!

Aug 01 '22 11:08 robmoffat

Hey @seii. You might have seen that I've implemented a MarkdownRenderer in https://github.com/commonmark/commonmark-java/pull/306 now. It's not perfect but it's a start and it focuses on producing equivalent Markdown, either from parsed nodes or from manually created nodes.

I'm sorry this PR didn't work out, closing it now.

I think the way forward for improving round-tripping should be a few focused smaller PRs, e.g. one for links, one for escaping, insignificant whitespace, etc. For each of those, the aim needs to be:

The data that is added to the nodes is additional, not a whole replacement of e.g. literal. But I think it also means that when a node tree is parsed from source, and then manipulated, some of that extra data needs to be reset in order not to produce an invalid output. E.g. if you parse a foo\!bar (unnecessary escape) as a Text, but then call setLiteral on it to change the text, the escape data needs to be cleared. (There is another option here, we could make nodes immutable once constructed but that's a bigger departure from the current API which is very mutable everywhere.)
If a node is constructed manually, the rendering still works and produces equivalent output. This should apply to the whole node tree or even if parts of it are processed.
It should not rely on IncludeSourceSpans being set. That doesn't mean the parser can't share some of the logic to achieve both, but from an API perspective, I think we should have good round-tripping without requiring source spans.

Mar 16 '24 03:03 robinst

commonmark-java commonmark-java copied to clipboard

Add roundtrip parsing/rendering

KNOWN ISSUE:

DETAILS:

commonmark-java
commonmark-java copied to clipboard