mdformat inappropriately replaces HTML entity references with the corresponding Unicode character
Describe the problem
HTML entity references are valid Markdown according to the CommonMark Spec https://spec.commonmark.org/0.30/#entity-and-numeric-character-references and are useful for specifying punctuation characters which are otherwise difficult or impossible to enter via the keyboard, and are particularly helpful when specifying special whitespace characters which would otherwise be invisible.
Unfortunately, mdformat replaces entity references with their corresponding Unicode characters, which it probably should not because they are valid Markdown under the CommonMark Spec. The replacement appears to be an artefact of the CommonMark Spec parsing requirements, and causes problems: it produces a formatted Markdown document which contains Unicode characters which are difficult to edit and may be difficult to identify. For example, the display difference between —, – and - will vary depending on context, may be minimal in some contexts, and may cause searches to fail in ways which are really confusing. Whitespace characters are, of course, invisible, which can be problematic in itself, and it's not clear to me that whitespace entity references which happened to occur at the beginning or end of a line and were replaced by Unicode whitespace characters will not then be subsequently removed by mdformat.
I would suggest that mdformat should not change entity and numeric references.
This should be easy to do, in principle, by disabling entity and numeric reference substitution in markdown-it-py when it's called... if the markdown-it#manage-rules facility has been ported? If it's not available, then perhaps porting the markdown-it-html-entities plugin will be necessary.
Link to your repository or website
No response
Steps to reproduce
$ echo 'This → `→` is an HTML entity reference.' > test.md
$ cat test.md
This → `→` is an HTML entity reference.
$ mdformat test.md && cat test.md
This → `→` is an HTML entity reference.
The version of Python you're using
No response
Your operating system
No response
Versions of your packages
No response
Additional context
The reasoning behind the CommonMark Spec requiring parsers to replace most entity references with their corresponding Unicode characters has been lost from the 3.0 spec, but can be seen in the 2.3 spec change log https://spec.commonmark.org/0.23/changes.html and in https://github.com/commonmark/commonmark-spec/issues/137#issuecomment-56757907 by John MacFarlane from 2014:
Let me say more about why the change made sense to me. I like the idea that the AST produced by the parser is not HTML specific. Having "entities" as AST elements, instead of the characters they represent, seems to go against that goal. Let's say we want to render to LaTeX. Knowing that we have as entity "ö" is much less useful than knowing that we have the character ö.
A couple of years later, John raised https://github.com/commonmark/commonmark-spec/issues/442 to reconsider this:
As noted in this thread, it might be desirable to change what the spec says about entities.
Arguably the spec should not require that entities be replaced (in the parsing phase) by unicode characters. A replacement will be necessary for some output formats, but there is no reason why an implementation that only targets HTML should do the replacement at all, and even an implementation that targets multiple formats might choose to handle entities in the renderer, or in an intermediate AST filter. And some implementations might want to preserve entities in the output.
markdown-it has a number of Issues related to entities and those which ask about disabling the entity reference replacement are typically advised to use .disable('entity') or the markdown-it-html-entities plugin:
Thanks for the issue!
Yeah HTML entities are replaced mainly because the parser does so by default so it was the path of least resistance.
I don't oppose to not doing so, but if we make the change we must test it extensively for any regressions. Having the entities as one codepoint is probably less prone to bugs because they are more likely to go through all of the processing (charater escaping, word wrapping etc.) as the same one codepoint, whereas a multi character representation is more likely to be accidentally broken.
The current behavior could also be seen as a feature of sorts: if your text accidentally contains a valid entity, you'll be able to notice it before rendering and add an escape. Also, Markdown source that is correctly "mdformatted" will more resemble rendered output.
And yeah, changing this will need changes in the parser, not only mdformat. I haven't looked into the linked extension that deeply but I assume something like that is needed. I don't think adding a .disable('entity') will be useful as I expect we won't be able to differentiate an HTML entity from an escaped HTML entity.
I appreciate your quick reply! I was testing over the weekend and I think that the current behaviour isn't as consistent as would be desired.
White space entity references show some anomalies, the most serious is the replacement of with because it now looks identical to a normal space but behaves like a non-breaking space in the rendered output and even in the editor. The only way to tell the difference will be to inspect the file with a hex editor.
A count of the number of named entity references in my Markdown documents, only the minority written by me, found the most frequently used entity reference is with 493 occurrences.
Other anomalies include the replacement of named entity references with numeric entity references. A very artificial example:
Input
Some white space ‏ ‎ ‍ ‌
     
Space is stripped from the front and back... 
Output
Some white space
 
 Space is stripped from the front and back... 
Note: What can't be seen, because white space is invisible, is that     has been removed from the second line.
Looking into the white space handling, the paragraph() function is responsible for the numeric entity reference substitutions →  and  →  above:
https://github.com/executablebooks/mdformat/blob/a856f538e2dcb81a83e9013fb073f16cd6e53972/src/mdformat/renderer/_context.py#L382-L386
As the comments indicate, this is trying to un-parse to recover the Markdown source, based on an assumption that the source specified a numeric entity reference rather than a named entity reference. That's a bit problematic, because the source may have specified the white space in a different way.
As a hack to work-around the entity replacement until markdownit-py is enhanced, I'm considering adding a configurable option which replaces certain Unicode characters with HTML entities, based on the HTML entities in my existing Markdown files (see below). Feedback on this approach would be welcome.
It's easy to generate a candidate list of HTML entities to replace Unicode characters:
$ find . -name "*.md" -exec grep --extended-regexp --only-matching "&\w+;" {} \; | sort | uniq -c | sort --reverse
40
12 μ
12 Ω
9 ×
7 °
3 ⋅
3 ±
2 →
1 ≈
So, looking into this, it rapidly became apparent that it will be much easier to disable the substitution than it will be to reverse substitutions. ~~The substitution appears to be made by a single call to html.unescape() in the block markdown-it-py/markdown_it/common/normalize_url.py#L32-L53. The block of code handles two separate cases, escaped character and entity, so they will have to be separated. I guess this needs to be raised as an Issue on markdown-it-py, but it also looks like it might be possible to implement this as a markdown-it-py plugin?~~ aaand, looking further, that's not the place.
Draft fix committed as https://github.com/jamesquilty/mdformat/commit/14f412287ddeb410a3d221e1387d7c6047db6cda, which calls .disable('entity'), as suggested above.