pandoc Markdown writer creates nested emph and strong sections

Explain the problem. The markdown writer doesn't catch nested emphasized and strong sections, leading to invalid formatting. Examples:

echo '<em>A<em>B</em>C</em>' | pandoc -f html -t markdown
*A*B*C*
# What the result is for the markdown reader:
echo '*A*B*C*' | pandoc -f markdown -t html
<p><em>A</em>B<em>C</em></p>

echo '<strong><strong>A</strong></strong>' | pandoc -f html -t markdown
****A****
# What the result is for the markdown reader:
echo `****A****` | pandoc -f markdown -t html
<p>****A****</p>

echo '<em><em>A</em></em>' | pandoc -f html -t markdown
**A**
# What the result is for the markdown reader:
echo `**A**` | pandoc -f markdown -t html
<p><strong>A</strong></p>

echo '<em><em>A</em>B</em>' | pandoc -f html -t markdown
**A*B*
# What the result is for the markdown reader:
echo '**A*B*' | pandoc -f markdown -t html
<p>**A<em>B</em></p>

Ideally the formatting state should be tracked and nested formatting that doesn't introduce any additional formatting should be a no-op.

Pandoc version? Linux pandoc 3.1.11.1 Features: +server +lua Scripting engine: Lua 5.4

Feb 27 '24 15:02 CodeSandwich

Note that these nestings will work for commonmark and derivatives (gfm etc.). And we use the same writer (with parameters) for markdown and commonmark.

We could either try to make the markdown parser smarter about these nestings... or adjust the writer so that, when it's producing pandoc markdown, it works around these issues, perhaps by using a _ for the outer emphasis.

If I recall correctly, the markdown parser was changed to ignore sequences of >= 4 *s in order to avoid exponential performance issues that can arise.

Feb 28 '24 16:02 jgm

Note also that the 3rd example will also cause problems for commonmark.

Feb 28 '24 16:02 jgm

I don't think that it can be solved in the reader. The markdown syntax by design can't convey nested tags, and without a new syntax the meaning of * and ** can only be inferred based on the context. For example what does *A*B*C* mean? Should the A*B part go deeper into the nesting or should it close the emphasized part? I think that the current approach which is to close the emphasis is the sane one.

I think that the writer can simply drop the inner formatting information. It will be lossy, but only for the structure, not for what the user will see after rendering. If this is the desired approach, then the above examples should behave like this, which IMO seems reasonable:

echo '<em>A<em>B</em>C</em>' | pandoc -f html -t markdown
*ABC*
# What the result is for the markdown reader:
echo '*ABC*' | pandoc -f markdown -t html
<p><em>ABC</em></p>

echo '<strong><strong>A</strong></strong>' | pandoc -f html -t markdown
**A**
# What the result is for the markdown reader:
echo `**A**` | pandoc -f markdown -t html
<p><strong>A</strong></p>

echo '<em><em>A</em></em>' | pandoc -f html -t markdown
*A*
# What the result is for the markdown reader:
echo `*A*` | pandoc -f markdown -t html
<p><em>A</em></p>

echo '<em><em>A</em>B</em>' | pandoc -f html -t markdown
*AB*
# What the result is for the markdown reader:
echo '*AB*' | pandoc -f markdown -t html
<p><em>AB</em></p>

Feb 28 '24 17:02 CodeSandwich

pandoc pandoc copied to clipboard

Markdown writer creates nested emph and strong sections

pandoc
pandoc copied to clipboard