turndown
turndown copied to clipboard
Span rules + br can break commonmark standard
Implementation here:
https://github.com/mixmark-io/turndown/blob/4499b5c313d30a3189a58fdd74fc4ed4b2428afd/src/commonmark-rules.js#L209
Reproducing example: turndown("<em>foo<br/></em>") == "_foo \n_"
https://spec.commonmark.org/0.30/#emphasis-and-strong-emphasis
A single _ character can close emphasis iff it is part of a right-flanking delimiter run and either (a) not part of a left-flanking delimiter run or (b) part of a left-flanking delimiter run followed by a Unicode punctuation character.
and
A right-flanking delimiter run is a delimiter run that is (1) not preceded by Unicode whitespace, and either (2a) not preceded by a Unicode punctuation character, or (2b) preceded by a Unicode punctuation character and followed by Unicode whitespace or a Unicode punctuation character. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.
This means that commonmark2html("_foo \n_") = "<p>_foo<br/>_</p>"
, i.e. the <em>
is lost.
The same is true for the other possible span delimiters (*
, __
, **
) and on a leading <br/>
in a span element.
As far as I can tell only <br/>
is affected. While <em><p>foo<p></em>bar
and similar abominations do trip up the context free replacement, they are fortunately not valid html
Added a pull request that demonstrates this and other corner cases:
https://github.com/mixmark-io/turndown/pull/406
Can we avoid it somehow???
1.
Zero width space and/or Non-breaking space:
<a href="https://bla-bla-bla">​​</a>text-text-text
produce:
[](https://bla-bla-bla)text-text-text
Is there any way to filter out (remove) html with zero visual content? Something like:
turndownService.addRule('al_spaces', {
regexFilter: '<[^<>]+?>[[:space:]]<\/.+?>',
replacement: function (content) {
return ''
}
})
2.
Line break which breaks markdown's markup:
<strong>bla-bla-bla<br></strong> <br>text-text-text
produce:
**bla-bla-bla
**
text-text-text
Is there any way to filter out (remove) all line breaks that precedes the closing tag? Something like:
turndownService.removeAllBefore('<br>', '</*>')
https://github.com/mixmark-io/turndown/issues/423
As far as I can tell only <br/> is affected.
Good to hear that, @zombiecalypse. Maybe this single exception can be added to be handled by rules
with adding span delimiters (_
, *
, __
, **
) before and after <br>
or <br/>
? @Flashwalker, removing
is no good because it should be preserved in the markdown.
My code uses const markdown = convertToMarkdown( article.content.replaceAll('<br></em>', '</em><br>') );
, but that is specific to the formating I encountered in one article:
https://github.com/SARAsBooks/html-to-markdown/blob/04e64d6074bd95903c331d167bb6edc869977986/automationWorkflow.js#L45