pandoc icon indicating copy to clipboard operation
pandoc copied to clipboard

Markdown writer creates nested emph and strong sections

Open CodeSandwich opened this issue 1 year ago • 3 comments

Explain the problem. The markdown writer doesn't catch nested emphasized and strong sections, leading to invalid formatting. Examples:

echo '<em>A<em>B</em>C</em>' | pandoc -f html -t markdown
*A*B*C*
# What the result is for the markdown reader:
echo '*A*B*C*' | pandoc -f markdown -t html
<p><em>A</em>B<em>C</em></p>
echo '<strong><strong>A</strong></strong>' | pandoc -f html -t markdown
****A****
# What the result is for the markdown reader:
echo `****A****` | pandoc -f markdown -t html
<p>****A****</p>
echo '<em><em>A</em></em>' | pandoc -f html -t markdown
**A**
# What the result is for the markdown reader:
echo `**A**` | pandoc -f markdown -t html
<p><strong>A</strong></p>
echo '<em><em>A</em>B</em>' | pandoc -f html -t markdown
**A*B*
# What the result is for the markdown reader:
echo '**A*B*' | pandoc -f markdown -t html
<p>**A<em>B</em></p>

Ideally the formatting state should be tracked and nested formatting that doesn't introduce any additional formatting should be a no-op.

Pandoc version? Linux pandoc 3.1.11.1 Features: +server +lua Scripting engine: Lua 5.4

CodeSandwich avatar Feb 27 '24 15:02 CodeSandwich

Note that these nestings will work for commonmark and derivatives (gfm etc.). And we use the same writer (with parameters) for markdown and commonmark.

We could either try to make the markdown parser smarter about these nestings... or adjust the writer so that, when it's producing pandoc markdown, it works around these issues, perhaps by using a _ for the outer emphasis.

If I recall correctly, the markdown parser was changed to ignore sequences of >= 4 *s in order to avoid exponential performance issues that can arise.

jgm avatar Feb 28 '24 16:02 jgm

Note also that the 3rd example will also cause problems for commonmark.

jgm avatar Feb 28 '24 16:02 jgm

I don't think that it can be solved in the reader. The markdown syntax by design can't convey nested tags, and without a new syntax the meaning of * and ** can only be inferred based on the context. For example what does *A*B*C* mean? Should the A*B part go deeper into the nesting or should it close the emphasized part? I think that the current approach which is to close the emphasis is the sane one.

I think that the writer can simply drop the inner formatting information. It will be lossy, but only for the structure, not for what the user will see after rendering. If this is the desired approach, then the above examples should behave like this, which IMO seems reasonable:

echo '<em>A<em>B</em>C</em>' | pandoc -f html -t markdown
*ABC*
# What the result is for the markdown reader:
echo '*ABC*' | pandoc -f markdown -t html
<p><em>ABC</em></p>
echo '<strong><strong>A</strong></strong>' | pandoc -f html -t markdown
**A**
# What the result is for the markdown reader:
echo `**A**` | pandoc -f markdown -t html
<p><strong>A</strong></p>
echo '<em><em>A</em></em>' | pandoc -f html -t markdown
*A*
# What the result is for the markdown reader:
echo `*A*` | pandoc -f markdown -t html
<p><em>A</em></p>
echo '<em><em>A</em>B</em>' | pandoc -f html -t markdown
*AB*
# What the result is for the markdown reader:
echo '*AB*' | pandoc -f markdown -t html
<p><em>AB</em></p>

CodeSandwich avatar Feb 28 '24 17:02 CodeSandwich