pandoc
pandoc copied to clipboard
un-escaped characters for asciidoc output
Pandoc version 1.15.0.6 doesn't correctly escape asciidoc output
$ echo '<a href="http://example.com">][</a>' | pandoc -f html -t asciidoc
http://example.com[][]
which asciidoc would render back as ...
$ echo '<a href="http://example.com">][</a>' | pandoc -f html -t asciidoc|asciidoc - |grep example\.com
<div class="paragraph"><p><a href="http://example.com">http://example.com</a>[]</p></div>
Unfortunately, the rules for escaping asciidoc special chars are complex and I cannot point to a single place in the asciidoc documentation. The general rule is that the '' character is used to escape. So with correct quoting/escaping ...
$ echo '<a href="http://example.com">\][</a>' | pandoc -f html -t asciidoc|asciidoc - |grep example\.com
<div class="paragraph"><p><a href="http://example.com">][</a></p></div>
References
- http://www.methods.co.nz/asciidoc/userguide.html#X51
- http://www.methods.co.nz/asciidoc/faq.html#_how_can_i_escape_asciidoc_markup
- http://www.methods.co.nz/asciidoc/faq.html#_some_elements_can_8217_t_be_escaped_with_a_single_backslash (weird cases!)
- http://www.methods.co.nz/asciidoc/faq.html#_how_can_i_escape_a_list
- https://github.com/jgm/pandoc/issues/2334
It would be easy to escape all these special characters, but the output would likely be ugly. Not sure it's worth it if these cases are rare...
Escaping with backslashes is not that easy in asciidoc, because it is very picky about only accepting a backslash escape in exactly that cases were it would recognize a command (with exceptions), otherwise it will render a backslash literal. (I'm using asciidoctor as the reference here, I haven't tried the orginal implementation)
E.g. escaping <<...>> to make asciidoc not render them as in-document references
\<<not a proper reference>
\<<proper reference>>
will render a backslash for the first line:
\<<not a proper reference> <<proper reference>>
Edit: Apparently, there is a much more reliable way to do this with passthroughs: ++<<++proper reference>> will work just fine. This is the unconstrained version of the +...+ passthrough markers. Here is the relevant section of the documentation: Escaping unconstrained quotes
I believe I have another two instances of this but with this mediawiki, input:
# pandoc-mediawiki-asciidoc-bug.mediawiki file
Syntax defect begin <code>[a-zA-Z_][a-zA-Z0-9_]*</code> (syntax defect middle <code>__</code>) syntax defect near-end <code>[a-zA-Z_:][a-zA-Z0-9_:]*</code>. syntax defect end.
I have used variations of the phrase "syntax defect" as a way to sanitize and minimize the real-life source, and to illustrate the defect. Converting the file with pandoc -s -f mediawiki pandoc-mediawiki-asciidoc-bug.mediawiki -t asciidoc provides this output:
Syntax defect begin `[a-zA-Z_][a-zA-Z0-9_]*` (syntax defect middle `__`)
syntax defect near-end `[a-zA-Z_:][a-zA-Z0-9_:]*`. syntax defect end.
There are two escape issues with the output identified below with ^ characters:
Syntax defect begin `[a-zA-Z_][a-zA-Z0-9_]*` (syntax defect middle `__`)
^ ^
syntax defect near-end `[a-zA-Z_:][a-zA-Z0-9_:]*`. syntax defect end.
Before the characters indicated by the ^ should be a literal \ to escape them, as in:
Syntax defect begin `[a-zA-Z_][a-zA-Z0-9_]\*` (syntax defect middle `\__`)
syntax defect near-end `[a-zA-Z_:][a-zA-Z0-9_:]*`. syntax defect end.
To summarize: I believe there are two separate defects broadly related to unescaped characters:
- The two regular expressions appear to interact with one another, with the
*character in the first regex appearing to act as a bold start and the*in the second regex acting as the bold end. - The
__appears to act as a single_when it should be treated as a literal__because it is between<code></code>mediawiki tags.
Version information:
pandoc 2.5
Compiled with pandoc-types 1.17.5.4, texmath 0.11.1.2, skylighting 0.7.4
MacOS 10.13.6, pandoc installed via homebrew.
Asciidoc is crazy!! With this input
`[0-9]*`
`[0-9]*`
asciidoctor gives you
<code><strong class="0-9"></code>
<code>[0-9]</strong></code>
which isn't even well-formed HTML. But with
`0-9*`
`0-9*`
you get
<code>0-9*</code>
<code>0-9*</code>
I have to believe this is a bug in asciidoctor and not the intended behavior. I'm not going to try to work around all these quirks.
EVen worse, if you try to escape the *s in the first example above
`[0-9]\*`
`[0-9]\*`
you get
<code>[0-9]*</code>
<code>[0-9]\*</code>
The first backslash acts as an escape and the second one doesn't! If this is intentional, it's an insane design decision. How are users supposed to keep track of what a backslash does in these contexts??
@jgm Would it help if we opened an issue about this with the upstream project, or supported you (as the owner of this repo) in that endeavour?
@lisa If you'd like to inquire upstream about whether this is intended behavior, and ask them to clarify the escaping rules, that would be great.
Passthrough quotes fix this as well:
`++[0-9]*++`
`++[0-9]*++`
will produce the intended output.
Still also a bug in asciidoctor, as the output isn't proper html.
Escaping is also missing for these relatively simple cases:
$ pandoc --version
pandoc.exe 2.7.3
Compiled with pandoc-types 1.17.5.4, texmath 0.11.2.2, skylighting 0.8.1
[...]
$ echo "*Foo*" | pandoc -f html -t asciidoctor
*Foo*
$ echo "_Foo_" | pandoc -f html -t asciidoctor
_Foo_
Unfortunately I'm not sure what the "correct" output is in cases like this. According to https://asciidoctor.org/docs/asciidoc-syntax-quick-reference/#escaping-text I guess it would be \*Foo* and \_Foo_, but the handling of backslash escaping seems quite complex and this might break down in more complicated cases. There's also plus escaping, the pass macro, character replacement ({asterisk} for *, doesn't seem to be one for _) and possibly more options ...
Unfortunately escaping in asciidoc is not well designed.
+1 for escaping, at least in URLs
Passthrough quotes fix this as well:
`++[0-9]*++` `++[0-9]*++`will produce the intended output.
OK...but what if you want to have ++[0-9*++, quoting the plus signs too? I tried
`\+\+[0-9]*\+\+`
`\+\+[0-9]*\+\+`
which yields
<code>+\+<strong class="0-9">\+\+</code>
<code>+\+[0-9]</strong>\+\+</code>
in which the first backslash acts as an escape but the others don't. Argh! Asciidoc needs some clear, consistent escaping rules.
I think this works
`pass:[++[0-9\]*++]`
You can use backslashes here to escape the ] in this context.
I have a local patch similar to what @mako4 suggests, but with one modification. it specifies that special character substitutions still apply; otherwise asciidoctor will pass-through special html characters into the final document:
pass:specialcharacters[++[0-9\]*++]
asciidoctor allows "c" as an abbreviation for "specialcharacters". I haven't implemented that yet, but makes things very slightly less ugly. If this is applied in escapeString to only texts with the special characters in it the output isn't too ugly for prose.
The alternative is to define attributes for each special character:
:plus: +
:rbracket: ]
:lbracket: [
:star: *
{plus}{plus}{lbracket}0-9{rbracket}{star}{plus}{plus}
When there are relatively few special characters the latter looks better, when there are many the former looks better.
So to summarize: for asciidoctor, at least, we can do
`pass:c[CODE]`
where CODE is the raw code with all ] characters backslash-escaped.
(Question: what about backslashes in the code, should they all be backslash-escaped too?)
@jgm no, you can't escape backslashes, which means the CODE part of pass:c[CODE] may not end with a backslash sigh.
Hm, it also means that if code contains \] already, the backslash will disappear.
Just did an experiment: it looks like you can use numeric entities to escape special characters inside ...
Example
`[0-9]*`
output from asciidoc (original):
<p><code>&#x5b;&#x30;&#x2d;&#x39;&#x5d;&#x2a;</code></p>
output from asciidoctor:
<p><code>[0-9]*</code></p>
So that's an interesting behavior change!
Also, I somehow missed this (maybe it's a new addition?), but this works in asciidoctor only:
{blank}{empty}{sp}{nbsp}{zwsp}{wj}{apos}{quot}{lsquo}{rsquo}{ldquo}{rdquo}{deg}{plus}{brvbar}{vbar}{amp}{lt}{gt}{startsb}{endsb}{caret}{asterisk}{tilde}{backslash}{backtick}{two-colons}{two-semicolons}{cpp}{pp}
Output from asciidoctor:
<p>  ​⁠'"‘’“”°+¦|&<>[]^*~\`::;;C++++</p>
Still an issue in latest pandoc:
(venv)kbroch@penguin:~ $ pandoc --version
pandoc 3.1.4
Features: +server +lua
Scripting engine: Lua 5.4
User data directory: /home/kbroch/.local/share/pandoc
Copyright (C) 2006-2023 John MacFarlane. Web: https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.
(venv)kbroch@penguin:~ $ pandoc -o debug.adoc debug-escaping-asterisk.docx
(venv)kbroch@penguin:~ $ cat debug.adoc
*There should be asterisks on either side of this*
(venv)kbroch@penguin:~ $
But should see: \*There should be asterisks on either side of this\*
@kbroch-rivosinc Please see the comments above. The suggested output you give isn't correct. Given this input, asciidoctor yields the following HTML:
<p>*There should be asterisks on either side of this\*</p>
If it were just a matter of backslash-escaping all * signs, we could easily do that. But that won't work. The interpretation of backslashes is highly non-regular, and that's why this issue is still open...
@jgm : here's what I see from asciidoctor (sorry I should have put this in original comment):
(venv)kbroch@penguin:~ $ asciidoctor --version
Asciidoctor 2.0.20 [https://asciidoctor.org]
Runtime Environment (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]) (lc:UTF-8 fs:UTF-8 in:UTF-8 ex:UTF-8)
(venv)kbroch@penguin:~ $ asciidoctor debug.adoc
(venv)kbroch@penguin:~ $ grep "should be asterisks" debug.html
<p><strong>There should be asterisks on either side of this</strong></p>
Well yes, asciidoc(tor) will turn
*hello*
into strong emphasis. But we're concerned here with how to represent literal asterisk characters. And asciidoctor turns
\*hello\*
into
<p>*hello\*</p>
So we can't simply backslash-escape all the literal asterisks as you suggested.
Thanks for explanation. I see above: https://github.com/jgm/pandoc/issues/2337#issuecomment-502476384 where all this was explained. Sorry I didn't catch it the first time. I appreciate you taking the time to help.