pandoc un-escaped characters for asciidoc output

trafficstars

Pandoc version 1.15.0.6 doesn't correctly escape asciidoc output

$ echo '<a href="http://example.com">][</a>' | pandoc -f html -t asciidoc
http://example.com[][]

which asciidoc would render back as ...

$ echo '<a href="http://example.com">][</a>' | pandoc -f html -t asciidoc|asciidoc - |grep example\.com
<div class="paragraph"><p><a href="http://example.com">http://example.com</a>[]</p></div>

Unfortunately, the rules for escaping asciidoc special chars are complex and I cannot point to a single place in the asciidoc documentation. The general rule is that the '' character is used to escape. So with correct quoting/escaping ...

$ echo '<a href="http://example.com">\][</a>' | pandoc -f html -t asciidoc|asciidoc - |grep example\.com
<div class="paragraph"><p><a href="http://example.com">][</a></p></div>

References

http://www.methods.co.nz/asciidoc/userguide.html#X51
http://www.methods.co.nz/asciidoc/faq.html#_how_can_i_escape_asciidoc_markup
http://www.methods.co.nz/asciidoc/faq.html#_some_elements_can_8217_t_be_escaped_with_a_single_backslash (weird cases!)
http://www.methods.co.nz/asciidoc/faq.html#_how_can_i_escape_a_list
https://github.com/jgm/pandoc/issues/2334

Jul 30 '15 07:07 tsagkase

It would be easy to escape all these special characters, but the output would likely be ugly. Not sure it's worth it if these cases are rare...

May 06 '17 19:05 jgm

Escaping with backslashes is not that easy in asciidoc, because it is very picky about only accepting a backslash escape in exactly that cases were it would recognize a command (with exceptions), otherwise it will render a backslash literal. (I'm using asciidoctor as the reference here, I haven't tried the orginal implementation)

E.g. escaping <<...>> to make asciidoc not render them as in-document references

\<<not a proper reference>

\<<proper reference>>

will render a backslash for the first line:

\<<not a proper reference> <<proper reference>>

Edit: Apparently, there is a much more reliable way to do this with passthroughs: ++<<++proper reference>> will work just fine. This is the unconstrained version of the +...+ passthrough markers. Here is the relevant section of the documentation: Escaping unconstrained quotes

Apr 10 '18 18:04 mako4

I believe I have another two instances of this but with this mediawiki, input:

# pandoc-mediawiki-asciidoc-bug.mediawiki file
Syntax defect begin <code>[a-zA-Z_][a-zA-Z0-9_]*</code> (syntax defect middle <code>__</code>) syntax defect near-end <code>[a-zA-Z_:][a-zA-Z0-9_:]*</code>. syntax defect end.

I have used variations of the phrase "syntax defect" as a way to sanitize and minimize the real-life source, and to illustrate the defect. Converting the file with pandoc -s -f mediawiki pandoc-mediawiki-asciidoc-bug.mediawiki -t asciidoc provides this output:

Syntax defect begin `[a-zA-Z_][a-zA-Z0-9_]*` (syntax defect middle `__`)
syntax defect near-end `[a-zA-Z_:][a-zA-Z0-9_:]*`. syntax defect end.

There are two escape issues with the output identified below with ^ characters:

Syntax defect begin `[a-zA-Z_][a-zA-Z0-9_]*` (syntax defect middle `__`)
                                          ^                         ^
syntax defect near-end `[a-zA-Z_:][a-zA-Z0-9_:]*`. syntax defect end.

Before the characters indicated by the ^ should be a literal \ to escape them, as in:

Syntax defect begin `[a-zA-Z_][a-zA-Z0-9_]\*` (syntax defect middle `\__`)
syntax defect near-end `[a-zA-Z_:][a-zA-Z0-9_:]*`. syntax defect end.

To summarize: I believe there are two separate defects broadly related to unescaped characters:

The two regular expressions appear to interact with one another, with the * character in the first regex appearing to act as a bold start and the * in the second regex acting as the bold end.
The __ appears to act as a single _ when it should be treated as a literal __ because it is between <code></code> mediawiki tags.

Version information:

pandoc 2.5
Compiled with pandoc-types 1.17.5.4, texmath 0.11.1.2, skylighting 0.7.4

MacOS 10.13.6, pandoc installed via homebrew.

Jan 09 '19 14:01 lisa

Asciidoc is crazy!! With this input

`[0-9]*`
`[0-9]*`

asciidoctor gives you

<code><strong class="0-9"></code>
<code>[0-9]</strong></code>

which isn't even well-formed HTML. But with

`0-9*`
`0-9*`

you get

<code>0-9*</code>
<code>0-9*</code>

I have to believe this is a bug in asciidoctor and not the intended behavior. I'm not going to try to work around all these quirks.

Jan 09 '19 17:01 jgm

EVen worse, if you try to escape the *s in the first example above

`[0-9]\*`
`[0-9]\*`

you get

<code>[0-9]*</code>
<code>[0-9]\*</code>

The first backslash acts as an escape and the second one doesn't! If this is intentional, it's an insane design decision. How are users supposed to keep track of what a backslash does in these contexts??

Jan 09 '19 17:01 jgm

@jgm Would it help if we opened an issue about this with the upstream project, or supported you (as the owner of this repo) in that endeavour?

Jan 09 '19 18:01 lisa

@lisa If you'd like to inquire upstream about whether this is intended behavior, and ask them to clarify the escaping rules, that would be great.

Jan 09 '19 18:01 jgm

Passthrough quotes fix this as well:

`++[0-9]*++`
`++[0-9]*++`

will produce the intended output.

Still also a bug in asciidoctor, as the output isn't proper html.

Feb 04 '19 09:02 mako4

Escaping is also missing for these relatively simple cases:

$ pandoc --version
pandoc.exe 2.7.3
Compiled with pandoc-types 1.17.5.4, texmath 0.11.2.2, skylighting 0.8.1
[...]
$ echo "*Foo*" | pandoc -f html -t asciidoctor
*Foo*
$ echo "_Foo_" | pandoc -f html -t asciidoctor
_Foo_

Unfortunately I'm not sure what the "correct" output is in cases like this. According to https://asciidoctor.org/docs/asciidoc-syntax-quick-reference/#escaping-text I guess it would be \*Foo* and \_Foo_, but the handling of backslash escaping seems quite complex and this might break down in more complicated cases. There's also plus escaping, the pass macro, character replacement ({asterisk} for *, doesn't seem to be one for _) and possibly more options ...

Jun 16 '19 18:06 henribru

Unfortunately escaping in asciidoc is not well designed.

Jun 21 '19 17:06 jgm

+1 for escaping, at least in URLs

Jul 02 '19 21:07 grv87

Passthrough quotes fix this as well:
`++[0-9]*++`
`++[0-9]*++`
will produce the intended output.

OK...but what if you want to have ++[0-9*++, quoting the plus signs too? I tried

`\+\+[0-9]*\+\+`
`\+\+[0-9]*\+\+`

which yields

<code>+\+<strong class="0-9">\+\+</code>
<code>+\+[0-9]</strong>\+\+</code>

in which the first backslash acts as an escape but the others don't. Argh! Asciidoc needs some clear, consistent escaping rules.

Sep 02 '19 16:09 jgm

I think this works

`pass:[++[0-9\]*++]`

You can use backslashes here to escape the ] in this context.

Nov 04 '19 15:11 mako4

I have a local patch similar to what @mako4 suggests, but with one modification. it specifies that special character substitutions still apply; otherwise asciidoctor will pass-through special html characters into the final document:

pass:specialcharacters[++[0-9\]*++]

asciidoctor allows "c" as an abbreviation for "specialcharacters". I haven't implemented that yet, but makes things very slightly less ugly. If this is applied in escapeString to only texts with the special characters in it the output isn't too ugly for prose.

The alternative is to define attributes for each special character:

:plus: +
:rbracket: ]
:lbracket: [
:star: *

{plus}{plus}{lbracket}0-9{rbracket}{star}{plus}{plus}

When there are relatively few special characters the latter looks better, when there are many the former looks better.

Apr 29 '21 20:04 jasom

So to summarize: for asciidoctor, at least, we can do

`pass:c[CODE]`

where CODE is the raw code with all ] characters backslash-escaped. (Question: what about backslashes in the code, should they all be backslash-escaped too?)

May 17 '22 02:05 jgm

@jgm no, you can't escape backslashes, which means the CODE part of pass:c[CODE] may not end with a backslash sigh.

May 17 '22 03:05 jasom

Hm, it also means that if code contains \] already, the backslash will disappear.

May 17 '22 04:05 jgm

Just did an experiment: it looks like you can use numeric entities to escape special characters inside ... Example

`&#x5b;&#x30;&#x2d;&#x39;&#x5d;&#x2a;`

output from asciidoc (original):

<p><code>&amp;#x5b;&amp;#x30;&amp;#x2d;&amp;#x39;&amp;#x5d;&amp;#x2a;</code></p>

output from asciidoctor:

<p><code>&#x5b;&#x30;&#x2d;&#x39;&#x5d;&#x2a;</code></p>

So that's an interesting behavior change!

May 17 '22 04:05 jgm

Also, I somehow missed this (maybe it's a new addition?), but this works in asciidoctor only:

{blank}{empty}{sp}{nbsp}{zwsp}{wj}{apos}{quot}{lsquo}{rsquo}{ldquo}{rdquo}{deg}{plus}{brvbar}{vbar}{amp}{lt}{gt}{startsb}{endsb}{caret}{asterisk}{tilde}{backslash}{backtick}{two-colons}{two-semicolons}{cpp}{pp}

Output from asciidoctor:

<p> &#160;&#8203;&#8288;&#39;&#34;&#8216;&#8217;&#8220;&#8221;&#176;&#43;&#166;|&<>[]^*~\`::;;C&#43;&#43;&#43;&#43;</p>

May 17 '22 04:05 jasom

Still an issue in latest pandoc:

(venv)kbroch@penguin:~ $ pandoc --version
pandoc 3.1.4
Features: +server +lua
Scripting engine: Lua 5.4
User data directory: /home/kbroch/.local/share/pandoc
Copyright (C) 2006-2023 John MacFarlane. Web: https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.
(venv)kbroch@penguin:~ $ pandoc -o debug.adoc debug-escaping-asterisk.docx 
(venv)kbroch@penguin:~ $ cat debug.adoc 
*There should be asterisks on either side of this*
(venv)kbroch@penguin:~ $

But should see: \*There should be asterisks on either side of this\*

debug-escaping-asterisk.docx

Jul 05 '23 06:07 kbroch-rivosinc

@kbroch-rivosinc Please see the comments above. The suggested output you give isn't correct. Given this input, asciidoctor yields the following HTML:

<p>*There should be asterisks on either side of this\*</p>

If it were just a matter of backslash-escaping all * signs, we could easily do that. But that won't work. The interpretation of backslashes is highly non-regular, and that's why this issue is still open...

Jul 05 '23 15:07 jgm

@jgm : here's what I see from asciidoctor (sorry I should have put this in original comment):

(venv)kbroch@penguin:~ $ asciidoctor --version
Asciidoctor 2.0.20 [https://asciidoctor.org]
Runtime Environment (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]) (lc:UTF-8 fs:UTF-8 in:UTF-8 ex:UTF-8)
(venv)kbroch@penguin:~ $ asciidoctor debug.adoc 
(venv)kbroch@penguin:~ $ grep "should be asterisks" debug.html 
<p><strong>There should be asterisks on either side of this</strong></p>

Jul 06 '23 06:07 kbroch-rivosinc

Well yes, asciidoc(tor) will turn

*hello*

into strong emphasis. But we're concerned here with how to represent literal asterisk characters. And asciidoctor turns

\*hello\*

into

<p>*hello\*</p>

So we can't simply backslash-escape all the literal asterisks as you suggested.

Jul 06 '23 15:07 jgm

Thanks for explanation. I see above: https://github.com/jgm/pandoc/issues/2337#issuecomment-502476384 where all this was explained. Sorry I didn't catch it the first time. I appreciate you taking the time to help.

Jul 06 '23 17:07 kbroch-rivosinc

pandoc pandoc copied to clipboard

un-escaped characters for asciidoc output

pandoc
pandoc copied to clipboard