pulldown-cmark-to-cmark Add round-trip fuzz test

This checks that the library can accurately go from Markdown to events and back to the same Markdown. I'm trying to use the library in https://github.com/google/mdbook-i18n-helpers/issues/19.

My assumption was that I can round-trip Markdown through the library. While there is a lot of Markdown input which can produce a given sequence of pulldown_cmark::Events, I was hoping to normalize the output by running it through cmark. So I hoped that

    let round_trip_1 = round_trip(&text);
    let round_trip_2 = round_trip(&round_trip_1);
    assert_eq!(round_trip_1, round_trip_2);

would hold, there round_trip runs text through cmark(Parser::new(&text), &mut result), see round-trip.rs.

The fuzz test fails. To run it yourself, install cargo fuzz and run

cargo fuzz run round-trip -- -only_ascii=1

Input like "**" becomes "\\**" after one iteration and then becomes "\\*\\*" after a second round of normalization:

"**" -> [
  Start(Paragraph)
  Text(Borrowed("**"))
  End(Paragraph)
]

The events become "\\**", which the become these events:

"\\**" -> [
  Start(Paragraph)
  Text(Borrowed("*"))
  Text(Borrowed("*"))
  End(Paragraph)
]

These events are finally turned into "\\*\\*".

I can work around this case by merging adjacent Text events — but if I do that, I find a more complex input that fails the fuzz test.

Is there a chance to have the library support this kind of round-tripping?

May 27 '23 17:05 mgeisler

Thanks for bringing this up! I agree that round-tripping would be a great property, but believe that the markdown parser in its current incarnation at least is inherently lossy in a couple of places.

I agree that there will be a lossy step here: a list item will result in Start(Item) being emitted regardless of what kind of "bullet" you used in the Markdown (* foo vs - foo, ...).

That is expected and I'm actually talking about round-tripping after such normalization. That is why my fuzzer does this:

    let round_trip_1 = round_trip(&text);
    let round_trip_2 = round_trip(&round_trip_1);
    assert_eq!(round_trip_1, round_trip_2);

The hope would be that round_trip_1 is normalized Markdown where all lists are formatted the same (using Options::list_token. It is only when round-tripping this normalized Markdown a second time that we can expect equality.

However, some brief testing showed me that this is not the case. I've tried round-tripping more and more times in the hope of finding a fixpoint. However, it seems that I'm just generating more elaborate test cases :smile:

Here is an input which changes shape at least five times:

"~~\\~~~`"
-> "~~~~~\\`"
-> "\n`````\n````"
-> "\n````\n````\n````"
-> "\n````\n````\n\n````\n````"

I updated the PR to have this code in case it's interesting.

My disposition here is to not merge due to non-determinism that could also be fully deterministic, but wanted to hear your thoughts first. Thanks for sharing.

Yes, definitey! This PR is not ready to be merged, let me mark it as a WIP to make it more clear to others.

The fuzz test here is more of a future goal, and perhaps a way to find small low-hanging fruit which spoil the round-tripping today.

May 27 '23 22:05 mgeisler

The corresponding events to the "~~\\~~~"` input:

"~~\\~~~`" -> [
  Start(Paragraph)
  Text(Borrowed("~~"))
  Text(Borrowed("~~~"))
  Text(Borrowed("`"))
  End(Paragraph)
]
"~~~~~\\`" -> [
  Start(CodeBlock(Fenced(Inlined(InlineStr { inner: [96, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], len: 1 }))))
  End(CodeBlock(Fenced(Inlined(InlineStr { inner: [96, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], len: 1 }))))
]
"\n`````\n````" -> [
  Start(CodeBlock(Fenced(Borrowed(""))))
  Text(Borrowed("````"))
  End(CodeBlock(Fenced(Borrowed(""))))
]
"\n````\n````\n````" -> [
  Start(CodeBlock(Fenced(Borrowed(""))))
  End(CodeBlock(Fenced(Borrowed(""))))
  Start(CodeBlock(Fenced(Borrowed(""))))
  End(CodeBlock(Fenced(Borrowed(""))))
]
"\n````\n````\n\n````\n````" -> [
  Start(CodeBlock(Fenced(Borrowed(""))))
  End(CodeBlock(Fenced(Borrowed(""))))
  Start(CodeBlock(Fenced(Borrowed(""))))
  End(CodeBlock(Fenced(Borrowed(""))))
]

It looks like a bug to me that Start(Paragraph) Text(Borrowed("~~")) ... End(Paragraph) is turned into something which is parsed as a Start(CodeBlock(..) next. I suppose the underlying problem here is that Options::code_block_token_count is a fixed integer — perhaps I could pick it conservatively based on the input? Something like counting the number of ~ or backticks int the input should help.

May 27 '23 22:05 mgeisler

Thanks so much for elaborating! I understand now that after one round it could stabilize even if it isn't generally round-tripable. It would also be my expectation that issues with multi-step round-tripping are solely to be attributed to this crate rather than the parser. I see great value in fixing this and appreciate your help with this. Maybe there could be a few tests that expose problems and make a fix feasible.

May 28 '23 20:05 Byron

I see great value in fixing this and appreciate your help with this. Maybe there could be a few tests that expose problems and make a fix feasible.

Yeah, I'll add normal tests for individual cases in a new PR. I see this PR as a bit of a test-generating-machine :smile:

May 30 '23 07:05 mgeisler

The "using a fuzzer to generate test-cases for invariants" definitely goes into the tool-belt :). Thanks for pointing it out.

May 30 '23 09:05 Byron

The "using a fuzzer to generate test-cases for invariants" definitely goes into the tool-belt :). Thanks for pointing it out.

Yeah, it's a new tool for me too! :smile: It's something I want to try applying more in various projects since it super powerful.

Jun 03 '23 15:06 mgeisler

Should we close this PR :)?

Feb 19 '25 06:02 Byron