commonmark-spec icon indicating copy to clipboard operation
commonmark-spec copied to clipboard

Character references in autolinks

Open xiaq opened this issue 3 years ago • 12 comments

The spec doesn't specify whether character references are supported inside autolinks. The following Markdown:

<aa:&#65;>

is rendered as the following by cmark:

<p><a href="aa:A">aa:A</a></p>

but as the following by commonmark.js:

<p><a href="aa:&amp;#65;">aa:&amp;#65;</a></p>

xiaq avatar Nov 05 '22 11:11 xiaq

Ah, I filed an issue about exactly the same problem in https://github.com/commonmark/commonmark.js/issues/263. So it seems that the intention is to supported character references inside autolinks.

Maybe we can add an example to the spec with a character reference in an autolink?

xiaq avatar Nov 05 '22 11:11 xiaq

I’m pretty strongly in the camp that character references should not work in autolinks. Except for this, they work in the same spaces where (backslash) character escapes work. Character escapes is in the same (preliminaries) section in the spec, and it has an example: https://spec.commonmark.org/0.30/#example-20.

I don’t think there should be one edge case where backslashes don’t work but characters references do?

wooorm avatar Nov 05 '22 11:11 wooorm

I think the motivation was that autolinks can be URLs that you just copy from some other source, and these might contain character references.

jgm avatar Nov 06 '22 00:11 jgm

I’m not sure about that reasoning: they might as well be fine unicode, particularly when coming from an address bar. I could see problems with double decoding. But, most important for me: it has to be consistent with character escapes.

wooorm avatar Nov 06 '22 00:11 wooorm

On motivation: do you mean cmark is more in line with your motivation? That the absence in cmjs was because it was forgotten? That no test for it in the spec was intended? What do you think about the test on character escapes but no test of character references?

wooorm avatar Nov 06 '22 00:11 wooorm

Yes, in the linked issue, I said I thought that cmark was getting it right. It could be worth adding a spec example for this.

jgm avatar Nov 06 '22 00:11 jgm

I see why it would be nice if entities got resolved in exactly the places backslash escapes do -- but again, this is motivated by a desire to support URL copy-pasting.

jgm avatar Nov 06 '22 00:11 jgm

Consistency with character escapes is most important to me. If the character escapes are allowed too I am open to it. I still see a lot of inconsistency for character references in Babelmark (so good to specify whatever the choice is). Here’s a test case of several normal cases and edge cases:

a <https://example&period;com>

b <https:&sol;&sol;example.com>

c <https&colon;//example.com>

d <&#104;ttps://example.com>

e <some&period;[email protected]>

f <some.user@example&period;com>

Note that C and D are not allowed per CommonMark as the protocol (part before and including :) does not allow &, ;, #. And that E and F are not allowed per CM because neither the part before @ (ASCII atext) nor after (domain) allow ;.

wooorm avatar Nov 06 '22 10:11 wooorm

@jgm IMO there is an equally valid argument against character reference if we are talking about copy-pasting: one could also copy-paste from a place that doesn't interpret character references, like the browser's URL bar, or a displayed webpage (as opposed to the HTML source).

xiaq avatar Nov 06 '22 12:11 xiaq

@xiaq - granted.

jgm avatar Nov 06 '22 17:11 jgm

Granting that there are these two possible sources for copy/paste, I think my reasoning was that if a valid character reference occurs in a copied URL, it's by far likeliest that its source is raw HTML rather than the browser's URL bar or a displayed web page. How often does one want to display something like &amp; in a URL?

jgm avatar Nov 06 '22 17:11 jgm

I mostly care about consistency, so then I’d also ask: how often does one want to display something like \?, where ? is any ASCII punctuation. If it’s consistent: I’m fine with it.

But thinking some more about this, while the motivation of “allow copy/paste” is a good one, to get there I believe we should then also allow unicode letters/punctuation in email atext, and unicode letters + at likely & + \ in email domains?

wooorm avatar Dec 26 '22 07:12 wooorm