pulldown-cmark-to-cmark Empty lines and whitespace are not preserved

It appears that either pulldown-cmark or pulldown-cmark-to-cmark itself silently shrinks any number of consecutive empty lines into a single empty line, and also strips trailing whitespace. For example, for this file test.md

When I run cargo run --example stupicat test.md I get

a

b

I see that in Options there are a number of tweaks to select the number of empty lines after a bunch of elements, but there is no option to preserve empty lines exactly as it was in the source document. (More generally I sought to preserve trailing whitespace after the end of each line as well: after the "a" above there are some spaces, but they are stripped in the output)

How easy would it be to workaround this? I mean, is this a limitation of pulldown-cmark itself (like, it throws away this information and there is no way to recover it), or just a missing feature in pulldown-cmark-to-cmark?

Jan 06 '25 14:01 dlight

Thanks a lot for reporting! This is a known issue, but it might be that by now there are better ways to deal with it.

To see if the events allow to reconstruct the original document, you could stupicat with this patch:

diff --git a/examples/stupicat.rs b/examples/stupicat.rs
index fb16e4e..4c724ef 100644
--- a/examples/stupicat.rs
+++ b/examples/stupicat.rs
@@ -26,7 +26,12 @@ fn main() -> Result<(), Box<dyn std::error::Error>> {
             state.finalize(&mut buf)?;
         }
     } else {
-        cmark(Parser::new_ext(&md, options), &mut buf)?;
+        cmark(
+            Parser::new_ext(&md, options).inspect(|item| {
+                dbg!(item);
+            }),
+            &mut buf,
+        )?;
     }
 
     stdout().write_all(buf.as_bytes())?;

Now it can be run like this:

 cargo run --example stupicat -- file-with-linebreaks.md

It's likely that pulldown-cmark is still degenerative, so it will remove information that would be needed to fully reconstruct the original.

However, by now there is the new cmark_resume_with_source_range_and_options() function that provides offsets into the original buffer, if available, which could be used to detect gaps and emit them as well. This would be a way to fix this, I think.

Jan 07 '25 06:01 Byron

Yes right now I am using offsets.. I get the offsets with Parser::into_offset_iter and then I can use String::replace_range in the original string. That way I guarantee that I am changing only the specific parts I care about, and not the rest of the document. Actually I was surprised this is not the way pulldown-cmark-to-cmark already works.

The problem I see with this approach is that, since I am concatenating strings in the end, if I am not careful (and escape all external input etc) I will generate invalid markdown. With pulldown-cmark-to-cmark there is no risk of generating invalid markdown, but I would need to track down every little corner case regarding whitespace.

(Another issue is that I need to keep around the original file, which may be undesirable if I am streaming from disk. But this is not a problem for me)

There's probably an abstraction that enables precise edits that don't change unrelated whitespace, but still guarantee that the generated markdown is valid. But not sure if it would be a good fit for this crate (I expect it is).

But maybe the best way to fix this is for pulldown-cmark to add "trivia" to its output, which is metadata that stores the trailing whitespace of each event it produces (like described in this rust-analyzer issue). That way, I think pulldown-cmark-to-cmark wouldn't need to be restructured - it would "just work" with minimal changes.

--

Also, the reason I want to not mess with whitespace is that I am editing files that are checked out in git, and I want to minimize diffs. While editors typically already strip trailing whitespace - so I probably don't need to worry about it too much - most don't strip consecutive newlines. And often handwritten markdown found in the wild will have some amount of empty lines in a row to better divide sections etc (or at least I do), even though markdownlint doesn't like it. So my line of thinking is, if someone wants to use a linter to normalize markdown, they should use the linter directly (through the vscode extension etc), my tool shouldn't mess with it.

Jan 09 '25 01:01 dlight

Yes right now I am using offsets.. I get the offsets with Parser::into_offset_iter and then I can use String::replace_range in the original string. That way I guarantee that I am changing only the specific parts I care about, and not the rest of the document. Actually I was surprised this is not the way pulldown-cmark-to-cmark already works.

I didn't ever think about doing that :)! The original idea was entirely based on streaming events, so there is no access to the input buffer. The latter was added later, but indeed it can be used that way. If you think that there is an abstraction that would facilitate the 'find-and-replace' workflow, please feel free to contribute it.

(Another issue is that I need to keep around the original file, which may be undesirable if I am streaming from disk. But this is not a problem for me)

I also think that in practice, all markdown files conveniently fit into memory :).

There's probably an abstraction that enables precise edits that don't change unrelated whitespace, but still guarantee that the generated markdown is valid. But not sure if it would be a good fit for this crate (I expect it is).

But maybe the best way to fix this is for pulldown-cmark to add "trivia" to its output, which is metadata that stores the trailing whitespace of each event it produces (like described in this rust-analyzer issue). That way, I think pulldown-cmark-to-cmark wouldn't need to be restructured - it would "just work" with minimal changes.

Indeed, there might be no sane way to respect whitespace unless pulldown-cmark changes its posture. Nonetheless, the fine-and-replace in original string seemed promising, maybe it's possible to even detect if whitespace isn't part of the event (but the range that it consumed) so it can be re-added. Of course, it's more of a workaround to an issue that better is fixed where it seems to originate.

Jan 09 '25 06:01 Byron

Ok so, I was thinking about this and I think that the signature of your cmark_resume_with_source_range_and_options basically has all information needed to work losslessly. So you don't need a new function, you just need to change the implementation of this function.

Maybe it would be nice to add a new kind of serialization option (maybe named lossless_output, lossless_formatting or something like that) that means "make the events with ranges a verbatim copy of the input, otherwise format normally", preserving whitespace, newlines, etc, and maybe set it by default (it would change the default serialization, but it can be kind of viewed like a bugfix; note you are already escaping markdown only if it escaped in source so there's already an attempt to match the input file, doing a verbatim copy is an extension to that).

But anyway the intented usage would be: for synthetic events (that is, for stuff you are injecting in the markdown but wasn't present in the input file), you must set the range to None. This is pretty important, because setting a range means copying from the source. So only if you are copying something from the source you should set the range.

There's some complications I can see, but they are not insurmountable:

Indentation changes. Maybe something was inside a list item in the original markdown and I am now emitting it at the top level; I should detect the indentation of where the item is placed and remove it. This is relevant for multi line list items. Or, maybe something was a multiline text that wasn't inside a list, and now I put it in a list item: it should be indented properly. I think that no extra input is needed, and the lib can work itself the indentation of items.
If I have some events in the source markdown with a Tag and a matching TagEnd, I may want to add or remove stuff from inside them. In this case I think it's good to add ranges for the Tag event and TagEnd (even though the contents aren't entirely unchanged), and for all unchanged events inside it; but not add ranges for new events (those should be pretty printed per the Options rules).
If someone for example change the contents of a Text event, they should emit a None range. If they don't, the library will be confused and may print the old event. This is a pretty bad footgun, because there is no way for a compiler warning to be emitted in this case. Guarding against that is kind of difficult (maybe the library can reparse that section and if it doesn't match the intended event, raise an error; but this could be slow and also, if we are reparsing anyway, what is the event for? Just the source range is enough)

Another option: rather than receiving an iterator of (Event, Option<Range>), we could receive an iterator of something like this simplified type EventOrVerbatimCopy (but with a better name)

enum EventOrVerbatimCopy<'a> {
   SyntheticEvent(Event<'a>),
   VerbatimCopy(Range<usize>),
}

Basically whenever you emit an event, you either create an entirely new event (an SyntheticEvent), or you copy something from the source markdown (a VerbatimCopy). If you are copying, you just need the range, you don't need the actual event name.

This would solve complication 3 above because it would be made clear that if you are creating a new event you should emit just the event, with no range; and if you are copying from the file, the event is often redundant

(But not always - this would need further elaboration to support the case of adding a new list item to a list etc so the actual type would end up being more complex than that, maybe adding a variant SyntheticEventWithRange(Event<'a>, Range<usize>) or something like that; but the gist of it is that for the simple case of copying some portion from the markdown file we probably don't need the corresponding event, and requiring it could be error prone)

Jan 11 '25 08:01 dlight

Thanks for sharing!

I'd also think that with the range-based iterator one should be able to do better, but only if the consumed range is complete, i.e. contains all the whitespace even though some of the whitespace might have been removed in the actual event.

Besides that, I'd not want to break the world with this, so an implementation attempt would probably add a new top-level function rather than change how the current ones work. Of course, if that's a hassle and it's not done, but it's all actually working, then one could consider a breaking change, but I still hope that can be cheaply avoided.

Jan 11 '25 09:01 Byron