pandoc
pandoc copied to clipboard
Pandoc does not convert an URL from HTML to asciidoc correctly
Explain the problem. Include the exact command line you used and all inputs necessary to reproduce the issue. Please create as minimal an example as possible, to help the maintainers isolate the problem. Explain the output you received and how it differs from what you expected.
Pandoc version? What version of pandoc are you using, on what OS?
Fedora 36 pandoc --version pandoc 2.14.0.3
Problem:
The URL/link is not converted correctly from HTML to adoc by Pandoc (reverse_adoc do not have this problem)
I have provided the sample files (HTML, converted file by Pandoc and Reverse_adoc) as a zip file in this ticket.
Pandoc command used:
pandoc -t asciidoctor -f html original.html -o pandoc.adoc
Samples: test.zip
Can you say what needs to be changed in the adoc output?
Is the problem the line break after [
on line 30?
Firstly, to answer your question, it is line 27 and line 28 in the pandoc.adoc file.
I got an update from Asciidoctor discussion forum (by Dan Allen). Here's more information:
- The source HTML is exported from Atlassian Confluence
- Dan points out that there are
<u>
tags in the HTML that wraps the URL (<a>
). This cause problem in the HTML to adoc conversion.<u>foo</u>
Produces:+++foo+++
After conversion to adoc, the adoc could not display the URL correctly when view in HTML format.
- He suggested that the
<u>
tags should be removed in the source HTML I tested it and pandoc performed the correct conversion:
- So Pandoc may consider ignoring the
<u>
tags when doing the conversion. BTW, I just spotted another problem if I manually removed the<u>
tags in the HTML: (The first link inside the Reference section of the HTML)
https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724[12.1 manual, Database Backup and Recovery User's Guide: [.enumeration_chapter]#Chapter 28# Transporting Data Across Platforms, Steps to Transport a Database to a Different Platform Using Backup Sets]
becomes
12.1 manual
instead of
12.1 manual, Database Backup and Recovery User's Guide: [.enumeration_chapter]#Chapter 28# Transporting Data Across Platforms, Steps to Transport a Database to a Different Platform Using Backup Sets
To be clear, then, is there no way to represent a link inside underline in adoc? That seems unfortunate if so.
With the other issue you found, please give the HTML of the link, the asciidoc output produced by pandoc, and the asciidoc you think it should have produced instead.
I am no expert on AsciiDoc, but I think the Confluence may be producing extra/wrong output using <u>
tags. Here's the original layout in Confluence, it is just URL links,
OK, the information below is referring to point 4 in my last reply (the link for 12.1 manual).
I have extracted the part from the source HTML to narrow down the problem.
I used command cat -vet
to display the line breaks.
$ cat -vet another-url-with-u-tags.html
<!DOCTYPE html>$
<html>$
<body>$
<p>$
<a href="https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724" class="external-link" rel="nofollow"><u>12.1 manual, Database Backup and Recovery User's Guide: <span class="enumeration_chapter">Chapter 28</span> Transporting Data Across Platforms, Steps to Transport a Database to a Different Platform Using Backup Sets</u>$
</a>$
</p>$
</body>$
</html>$
Conversion is performed on the HTML with <u>
and </u>
removed.
$ cat -vet another-url-without-u-tags.html
<!DOCTYPE html>$
<html>$
<body>$
<p>$
<a href="https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724" class="external-link" rel="nofollow">12.1 manual, Database Backup and Recovery User's Guide: <span class="enumeration_chapter">Chapter 28</span> Transporting Data Across Platforms, Steps to Transport a Database to a Different Platform Using Backup Sets$
</a>$
</p>$
</body>$
</html>$
1. Normal conversion by Pandoc
$ pandoc -t asciidoctor -f html -o pandoc.adoc another-url-without-u-tags.html
$ cat -vet pandoc.adoc
https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724[12.1$
manual, Database Backup and Recovery User's Guide:$
[.enumeration_chapter]#Chapter 28# Transporting Data Across Platforms,$
Steps to Transport a Database to a Different Platform Using Backup Sets]$
2. Pandoc conversion and preserve wrap
$ pandoc --wrap=preserve -t asciidoctor -f html -o pandoc-preserve-wrap.adoc another-url-without-u-tags.html
$ cat -vet pandoc-preserve-wrap.adoc
https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724[12.1 manual, Database Backup and Recovery User's Guide: [.enumeration_chapter]#Chapter 28# Transporting Data Across Platforms, Steps to Transport a Database to a Different Platform Using Backup Sets]$
3. Desired output
$ cat -vet desired.adoc
https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724[12.1 manual, Database Backup and Recovery User's Guide: Chapter 28 Transporting Data Across Platforms, Steps to Transport a Database to a Different Platform Using Backup Sets]$
$
Here's the preview on the combined output (in order):
This really looks to me like a bug in asciidoctor. This asciidoc
https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724[12.1
manual, Database Backup and Recovery User's Guide:
[.enumeration_chapter]#Chapter 28# Transporting Data Across Platforms,
Steps to Transport a Database to a Different Platform Using Backup Sets]
gets converted by asciidoc to
<p><a href="https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724">12.1
manual, Database Backup and Recovery User’s Guide:
<span class=".enumeration_chapter">Chapter 28</span> Transporting Data Across Platforms,
Steps to Transport a Database to a Different Platform Using Backup Sets</a></p>
which is fine. But asciidoctor converts it to
<p><a href="https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724">12.1 manual</a></p>
and just drops the rest. That is a bug in asciidoctor, no? If not, can someone point to something in asciidoctor's documentation that explains why we get this output?
Looks like it's a bug in AsciiDoctor. Let me check with AsciiDoctor for this part. (the first URL in Reference section)
It's not a bug in Asciidoctor. The AsciiDoc language now supports attributes in a link macro. The parsing rules are clearly described here: https://docs.asciidoctor.org/asciidoc/latest/macros/link-macro-attribute-parsing/#linked-text-alongside-named-attributes (It's the phrase with the role inside the link text that's introducing the =
sign).
The only way this can be expressed in modern AsciiDoc is as follows:
https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724[12.1
manual, Database Backup and Recovery User's Guide:
pass:n[[.enumeration_chapter\]#Chapter 28#] Transporting Data Across Platforms,
Steps to Transport a Database to a Different Platform Using Backup Sets]
To be clear, then, is there no way to represent a link inside underline in adoc?
There is. The way to represent underline text is as follows:
[.underline]#text#
<u>
is a formatting element, not a semantic element. Therefore the AsciiDoc language does not provide a direct translation for it. Instead, it correctly maps it to a phrase role (as it does for other formatting roles). See https://docs.asciidoctor.org/asciidoc/latest/text/text-span-built-in-roles/#built-in-roles-for-text
One thing I had noticed is that Confluence produced the HTML code with
<span class="enumeration_chapter">Chapter 28</span>
inside the link description
For the screenshot mentioned before
The desired output is actually generated by a program called reverse_adoc. They removed the enumeration_chapter class and just place the text "Chapter 28" into the text description. I cannot tell if it is semantically correct.
So based on the discussion in AsciiDoctor forum, I think this might be the desired output?
https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724["12.1 manual, Database Backup and Recovery User's Guide: \[.enumeration_chapter\]#Chapter 28# Transporting Data Across Platforms, Steps to Transport a Database to a Different Platform Using Backup Sets"]
So based on the discussion in AsciiDoctor forum, I think this might be the desired output?
That's not correct. The parsable output would be:
https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724[12.1
manual, Database Backup and Recovery User's Guide:
pass:n[[.enumeration_chapter\]#Chapter 28#] Transporting Data Across Platforms,
Steps to Transport a Database to a Different Platform Using Backup Sets]
However, I seriously question whether pandoc should be producing this output. It would be better to remove the formatting in the link text (or at least phrases with roles). We don't want to encourage this kind of complex markup as the whole point of AsciiDoc is to keep the markup as simple as possible. Thus, I agree with this suggestion:
The desired output is actually generated by a program called reverse_adoc. They removed the enumeration_chapter class and just place the text "Chapter 28" into the text description.
(It's the phrase with the role inside the link text that's introducing the = sign).
Sorry, you lost me there. In the original text
https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724[12.1
manual, Database Backup and Recovery User's Guide:
[.enumeration_chapter]#Chapter 28# Transporting Data Across Platforms,
Steps to Transport a Database to a Different Platform Using Backup Sets]
there is no =
sign. And if you take out the [.enumeration_chapter]#Chapter 28#
, the whole thing becomes part of the link text. How does [.enumeration_chapter]#Chapter 28#
introduce an = sign?
In any case, seems like a fragile syntax -- language designers may want to reconsider it, if link attributes can be accidentally triggered so easily.
How does [.enumeration_chapter]#Chapter 28# introduce an = sign?
It's added when the phrase with role is converted. What the parser sees is:
<span class="enumeration_chapter">Chapter 28</span>
That's where you get the equal sign.
In any case, seems like a fragile syntax
Perhaps. Lightweight markup languages are not designed to be perfectly robust. They are designed to be concise. And I don't believe link text should have formatting in it. So I consider this to be a reasonable tradeoff. I'm open to discussions about it. That's just where I currently stand.
With the current situation, I'm really at a loss as to how to handle this better in pandoc. I tried putting the whole link text in quotes, as suggested in the manual when it contains commas, but this results in malformed HTML, I guess because of the quotes introduced by the interpolation of the span...
<a href="https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724">12.1 manual, Database Backup and Recovery User’s Guide: <span class=</a>
We could remove all spans, links, images from link text, I suppose. Or we could try to use the complex escaping method you illustrate above (which seems to require some delicate backslash-escaping of ]
).
@mojavelinux Here are two suggestions (assuming you have something to do with the language spec).
-
If the comma is escaped, it should not introduce attributes. That would provide a simple workaround, consistent with the general principle that backslash-escaping special characters defeats their usual special meanings. Unfortunately backslash-escaping in asciidoc is a complete mess, but maybe that can be cleaned up. Looks like currently the comma cannot be backslash-escaped.
-
Don't treat the comma as introducing attributes if what follows is not of the proper form to be an attribute, e.g.
role=...
.
First, we can continue this discussion without the accusations. I have read "this is a bug in Asciidoctor", "seems like a fragile syntax", and "backslash escaping in AsciiDoc is a complete mess". There's just no need for that kind of attitude and it makes me want to walk away from this situation. If you want my input, please be respectful of the immense time, effort, and dedication I have put into this language.
We recognize that there's room for improvement in the syntax, just as there are with all things in life. That's a key part of why we formed the AsciiDoc Language project to specify and evolve the language. While I lead Asciidoctor and helped launch the effort for the language specification, changes to the language and the parsing rules have to be done through that project.
Until that project starts to move forward, the language is what it is right now. I can't make changes in Asciidoctor that change the parsing rules. Therefore, pandoc should position itself to work with the AsciiDoc language as it currently stands (based on the initial contribution, which is https://docs.asciidoctor.org/asciidoc/latest/).
There are two options I can suggest:
- Drop the role on an inline phrase inside of link text; such syntax is essentially forbid in the AsciiDoc language right now, so you aren't doing the wrong thing
- Enclose the inline phrase in an inline passthrough, as I showed above
On a related note, there is no requirement for an AsciiDoc converter to generate <span class="underline">underline me</span>
from [.underline]#underline me#
. It could just as well produce <u>underline me</u>
. The built-in converter just happens to produce the former for the reason I already cited about the use of a <u>
tag. But that decision is downstream from what pandoc (or a writer) produces.
Of course, I respect the amount of time and dedication it takes to work on a light markup language. The comments were intended as constructive suggestions, but their tone probably reflects the frustration I've had over the years trying to get pandoc to do the right thing in its asciidoc output. Let me avoid the negative tone of "complete mess" and just say that I have no understanding of how escaping works in asciidoc. Because of that, I'm hesitant to go with option 2, since it requires escaping things and I'm not sure I'd get it right in full generality. But maybe you can explain it. In this case, the desired output is:
pass:n[[.enumeration_chapter\]#Chapter 28#]
As a general method, would it be sufficient to follow this recipe?
- render the element as it would be rendered outside of a link
- add a backslash in front of every
]
in the result of 1 - enclose the result of 2 in
pass:n[...]
?
Will this work even when the element contains verbatim ]
characters, e.g., HTML <span class="foo"><code>]</code></span>
?
As for option 1: what, exactly, would we need to worry about inside link text? Do we have to avoid anything that renders in HTML with an =
? If so, that includes images. Since a lot of people put images in the link text, this could be a big limitation.
Thank you for acknowledging my concern. I will now continue to engage in this thread.
I have no understanding of how escaping works in asciidoc.
The rules have been documented the best way we can document them in the following two places:
- https://docs.asciidoctor.org/asciidoc/latest/subs/prevent/
- https://docs.asciidoctor.org/asciidoc/latest/pass/pass-macro/
It's well known that the escaping in AsciiDoc is not universal; nor is intended to be. And while it may (perhaps even likely) be something the language project considers adding, the language tries not to enable the writer to use a heavy amount of formatting because it goes against our tenants. If a writer needs that amount complexity, then HTML, DocBook, or LaTeX is what the writer should be using.
Having said that, the passthrough macro provides closer to the universal escaping that you're looking for. It takes everything from the left square bracket to the next right square bracket not proceeded by a backslash. It then un-escapes any escaped right square brackets. So you can escape all right square brackets within the enclosed text and it will do the right thing. However, keep in mind that the passthrough macro cannot be nested.
Will this work even when the element contains verbatim ] characters
No, it will not. But these characters could be escaped using ]
(as we do in Asciidoctor PDF). When trying to neutralize characters which have meaning in the syntax, using character references is often a workable strategy.
Do we have to avoid anything that renders in HTML with an =?
No. Substitution order matters a lot here. Inline images are substituted after the link macro. So it's safe to put an image in the link text (as long as its close square bracket is escaped). The problematic markup lies almost entirely with text formatting (which in AsciiDoc is currently called the "quotes substitution"). In other words, this markup: https://docs.asciidoctor.org/asciidoc/latest/text/#inline-text-and-punctuation-styles.
So we'd need to remove Strong, Emph, Code, and all other inline formatting for option 1?
And also backslash-escape any closing square brackets?
And also do something with any literal =
signs that happen to be there, perhaps using entities?
But we can avoid doing any of this as long as the link text doesn't contain a comma?
So we'd need to remove Strong, Emph, Code, and all other inline formatting for option 1?
You'd need to remove any roles (i.e., CSS classes). The formatting itself is fine. It's the introduction of what's indistinguishable from an attribute on an inline macro that's the problem (e.g., key="value").
You'd need to remove any roles (i.e., CSS classes). The formatting itself is fine. It's the introduction of what's indistinguishable from an attribute on an inline macro that's the problem (e.g., key="value").
But how do I know which things will get substituted by something with key="value" in your toolchain? You linked to https://docs.asciidoctor.org/asciidoc/latest/text/#inline-text-and-punctuation-styles , so I was assuming all of those...
I can offer what we've written about the language, but I can't do all the work for you. It's necessary to study and understand the language and its processor to know what decisions to make. From my viewpoint, that's part of the work of making a language translator. I'm happy to answer questions as they come up, but that's all that I can offer to do.
I would have thought that as a promoter of the language, it would be in your interest to have good tools for converting to it from other formats. I'm just not interested enough in asciidoc to spend more time on this, so I'm going to drop this thread. Maybe someone else will be interested enough to figure out how to handle these cases.
I'm happy to answer questions as they come up, but that's all that I can offer to do.
I did ask a question, above. So if you are really happy to answer questions, what is the answer?
Again with that attitude. I don't understand why you have to come at me like that when I'm offering my time to help you with your project. It's your project that's offering to translate to AsciiDoc, so I don't see why you are acting put upon that you actually have to learn the rules of the language. As I've said before, I'm very happy to answer your questions (and I go out of my way to do so), but ultimately this is not my project. I don't appreciate you trying to guilt me into making it my responsibility.
I don't see why you are acting put upon that you actually have to learn the rules of the language.
I have never used asciidoc, nor did I write this part of the code. The writer was contributed long ago by a third party. I'm happy to improve it in response to requests from asciidoc users, but I don't have time to become an expert in this format, so I need to rely on those who are.
I found a very simple solution. When there are commas in the link text, I convert them to numeric entities. That works well for the original case, above, and avoids the complexity of passthrough syntax. It could be that it has other unforeseen consequences; if so, please open a new issue.
That seems like a very reasonable approach. Nice thinking.
@jgm
I've got the Pandoc nightly version:
pandoc 2.18-nightly-2022-05-17
Compiled with pandoc-types 1.22.2, texmath 0.12.5, skylighting 0.12.3,
citeproc 0.7, ipynb 0.2, hslua 2.2.0
Scripting engine: Lua 5.4
Copyright (C) 2006-2022 John MacFarlane. Web: https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.
- For the first URL in the reference section Then the adoc is created by Pandoc as:
https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724[[.underline]#12.1 manual, Database Backup and Recovery User's Guide: [.enumeration_chapter]#Chapter 28# Transporting Data Across Platforms, Steps to Transport a Database to a Different Platform Using Backup Sets#]
Edit: For the link text, it stopped at the closing square bracket of .enumeration_chapter] other remaining text are just displayed as plain text.
<p><a href="https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724"><span class="underline">12.1 manual, Database Backup and Recovery User’s Guide: [.enumeration_chapter</a>#Chapter 28</span> Transporting Data Across Platforms, Steps to Transport a Database to a Different Platform Using Backup Sets#]</p>
- The second URL looks ok now (the <u tags / [.underline] / https://github.com/jgm/pandoc/commit/1906ae05488c7ffcc4cdd5c8f5b4fb1a2c527127)
The difference between this and the case I tested is that here the whole link is underlined.
So here we have nested spans delimited by #
characters:
[.underline]#.... [.enumeration_chapter]#....# ...#
I suspect that's the problem. Asciidoctor is closing the underline span at the third #. It may be possible to escape the third # or something; perhaps @mojavelinux can illuminate this.