commonmark-spec icon indicating copy to clipboard operation
commonmark-spec copied to clipboard

Backtick inside "(...)" part of an inline link

Open mity opened this issue 9 years ago • 22 comments

I found handling dealing inline links as problematic with respect to the following. Unlike for the [...], the specification does not say anything about how (...) in an inline link is tight in comparison to other marks.

I would naturally expect it has same priority as [...]. But I noticed that at least Cmark does not handle it in that way.

The issue is visible here:

[maybe link and maybe not](/url 'title with backtick`')`

(For more examples, see also posts below.)

Is this link or not?

Cmark thinks it is a link.

My natural expectation is it is not as the ( and ) should have same priority as [ and ].

Furthermore, my implementation uses three independent passes, each for different sets of marks, ordered by their precedence level and Cmark-like approach is very incompatible with such approach.

I see two possible solutions:

  1. The specification explicitly allows both cases. Rationale may be that mixing bactick with things inside the (...) is quite uncommon. It is rarely in any URL and rarely in a title as the titles are usually rendered without any markups anyway; or

  2. The specification could forbid (unescaped) backticks inside (...) portion of an inline link altogether.

mity avatar Nov 25 '16 14:11 mity

I also stumbled about this problem and I think that the spec needs some modifications to clarify this situation.

I also didn't understand how the different parts of an inline link (namely the [...] and the (...)) can have different precedence although they are part of the same structural element. It just didn't make sense to me.

In the meantime, I think I have understood how this is supposed to work ... at least I have a theory ...

The spec somewhat implies that inline links have a lower priority than code spans, autolinks and html tags. AFAICT, that's actually not true. Inline links have the same priority as code spans, autolinks and html tags! However, there's a catch: This kind of priority doesn't concern the elements as a whole, it rather concerns the element itself plus the input character that "triggers" the creation of this element.

In case of code spans this is the (initial) backtick string, in case of autolinks and html tags, this is the opening angle bracket <. The interesting thing about the inline link (same holds for reference links and images BTW) is that it isn't triggered by the opening bracket [ but rather by the closing bracket ]! I didn't find anything about this in the main part of the spec, but the description of the parsing strategy at the end of the spec suggests this.

@mity You should parse code spans and autolinks and html tags in the same pass anyway, because whichever starts first in the input string will "win" against others of the group which come later (if they are overlapping). For the same reason, you should parse links in the same pass, just looking for the closing bracket instead of the opening one.

mgeier avatar Jun 03 '17 13:06 mgeier

You should parse code spans and autolinks and html tags in the same pass anyway

MD4C parses those at the same time. It uses three steps as follows:

  1. Entities, code spans, autolinks, inline raw HTML.
  2. Links.
  3. Emphasis and strong emphasis.

(Well, if tables or other extensions are enabled, it is a bit more complicated, but lets keep that aside.)

For the same reason, you should parse links in the same pass, just looking for the closing bracket instead of the opening one.

Well. Interesting idea. But as far as I can see it is almost equivalent to parsing it in the 2nd step. The only difference is when the 2nd part of link ((...), or [...] in case of full link references) overlaps with something from the group 1.

I believe that can happen with the backticks (this was why this issue has been created) and also for raw HTML if the tag contains potential link end e.g. in a attribute value. I see both cases as quite a corner case.

@jgm So maybe the crucial question behind all of this is this: A. Is ambition of CommonMark specification to define and cover exact behavior of any Markdown document (i.e. of any sequence of Unicode characters)? B. Or is it to define enough rules for allowing reasonable authoring of documents, leaving very strange, rarely or never used details simply undefined?

In the case A, I'm afraid the spec shall never really be really finished, it would be order(s) of magnitude longer and most of it would be about specifying deep strange combinations of characters no sane person would use in combination, following it to the letter shall become almost impossible for any new implementation, and the only sane way how to implement it would be by rewriting Cmark, line by line, because the specification would be so complicated noone would really understand it. Imho, the specification should be easy to understand also by authors of document who may have no development background.

In the case B, I think handling of conflicts of code spans or raw HTML and 2nd part of links would be prime candidate to fall into the category of undefined behavior. And by creating this issue, I wanted mainly to get a confirmation other people share this point of view :-)

mity avatar Jun 03 '17 15:06 mity

Is the example right here?

[maybe link and maybe not](/url 'title with backtick`')

I can't see why this would be anything other than a link (there is no closing backtick so no code span should be present)?

Perhaps some interesting examples that yield strange results, inspired by the rest of the conversation here

[maybe link and maybe not](/url 'title with `backtick`')

[maybe link and `maybe not](/url 'title with `backtick`')

[maybe link and maybe not](/url 'title with backtick`')`

`[maybe link and maybe not](/url 'title with backtick`')
<p><a href="/url" title="title with `backtick`">maybe link and maybe not</a></p>

<p>[maybe link and <code>maybe not](/url 'title with</code>backtick`')</p>

<p><a href="/url" title="title with backtick`">maybe link and maybe not</a>`</p>

<p><code>[maybe link and maybe not](/url 'title with backtick</code>')</p>

aidantwoods avatar Jun 04 '17 18:06 aidantwoods

Is the example right here?

[maybe link and maybe not](/url 'title with backtick`')

I can't see why this would be anything other than a link (there is no closing backtick so no code span should be present)?

Agree. Given the current specs, this is imho the only one from those listed which is well-defined.

mity avatar Jun 04 '17 19:06 mity

As noted in some previous post, mixing link with raw HTML may also be unclear:

[maybe link and maybe not](/url '<maybehtmltag>')

[maybe link and maybe not](/url '<maybehtmltag attr="')">

<maybehtmltag attr="[maybe link and maybe not](/url '">')

mity avatar Jun 04 '17 19:06 mity

IMHO, since the spec states

A link text consists of a sequence of zero or more inline elements enclosed by square brackets ([ and ]).

A link destination consists of either

  • a sequence of zero or more characters between an opening < and a closing > that contains no spaces, line breaks, or unescaped < or > characters, or

  • a nonempty sequence of characters that does not include ASCII space or control characters, and includes parentheses only if (a) they are backslash-escaped or (b) they are part of a balanced pair of unescaped parentheses that is not itself inside a balanced pair of unescaped parentheses.

Notably, link text permits inline elements to be enclosed, while link destination does not. Thus IMO if an inline element A is present in the link destination of an inline link B, and A has higher precedence than B then B should be rendered and A should not.

In particular the code span seems to hold higher precedence than all other inlines in general, so it should always be preferred if that means breaking the interpretation of an inline link (since the link destination does not permit sub-elements).

aidantwoods avatar Jun 04 '17 19:06 aidantwoods

For reference, the parser I'm working on produces (and therefore my opinion of the correct output based on my interpretation of the spec) is:

[maybe link and maybe not](/url 'title with backtick`')

[maybe link and maybe not](/url 'title with `backtick`')

[maybe link and `maybe not](/url 'title with `backtick`')

[maybe link and maybe not](/url 'title with backtick`')`

`[maybe link and maybe not](/url 'title with backtick`')

[maybe `link` and maybe not](/url 'title with backtick`')

to

<p><a href="/url" title="title with backtick`">maybe link and maybe not</a></p>

<p>[maybe link and maybe not](/url 'title with <code>backtick</code>')</p>

<p>[maybe link and <code>maybe not](/url 'title with </code>backtick`')</p>

<p>[maybe link and maybe not](/url 'title with backtick<code>')</code></p>

<p><code>[maybe link and maybe not](/url 'title with backtick</code>')</p>

<p><a href="/url" title="title with backtick`">maybe <code>link</code> and maybe not</a></p>

aidantwoods avatar Jun 04 '17 20:06 aidantwoods

@aidantwoods I disagree with your conclusions. The specs talks only about priority of the link text:

  • Backtick code spans, autolinks, and raw HTML tags bind more tightly than the brackets in link text. Thus, for example, [foo`]` could not be a link text, since the second ] is part of a code span.
  • The brackets in link text bind more tightly than markers for emphasis and strong emphasis. Thus, for example, *[foo*](url) is a link.

Also most examples are not about nesting but about crossing starts/ends of elements (i.e. symbolically corresponding to this pattern: ( { ) } or in html notation <a><b></a></b>).

You quote something about link destinations, but the problematic part in the most of given examples are link titles. (Although some examples with strange destinations could also be created.)

So the priority of (...) part of the inline link is unclear, as of specs 0.27. As is also the 2nd part of full reference link:

[`x`]: /url

[a][`x`]

or

[`x]: /url

[a][`x]`

The purpose of this issue is to decide, how to handle all those cases.

IMHO, the only two sane positions are that either (a) the additional link components after link text should have the same priority as link text brackets, update the specs accordingly (and fix Cmark to follow it in all those cases); or (b) to declare it can be an undefined behavior so the implementation may choose either of the interpretations (make those artificial examples really a problem if implementations disagree on them?).

mity avatar Jun 04 '17 21:06 mity

@aidantwoods I disagree with your conclusions. The specs talks only about priority of the link text

You're correct there, I'm merely making an observation that code spans (mainly because of what they are meant to do) always seems to have higher priority than other inlines.

This to say that if code spans should have higher priority than every other inline (which if they didn't, they may not be as useful), then the particular example of the code span breaking an inline link because it was contained in the link destination should hold via no containment rules.

This isn't something that can be backed up by the spec at this time (because as you've pointed out, code spans only have higher precedence than the link text brackets at present).


You quote something about link destinations, but the problematic part in the most of given examples are link titles. (Although some examples with strange destinations could also be created.)

I see this as purely being an issue with the link destination since the behaviour can be replicated with no titles

[maybe link and maybe not](/url`backtick`)

[maybe link and `maybe not](/url`backtick`)

[maybe link and maybe not](/urlbacktick`)`

`[maybe link and maybe not](/urlbacktick`)

JS Reference parser:

<p><a href="/url%60backtick%60">maybe link and maybe not</a></p>

<p>[maybe link and <code>maybe not](/url</code>backtick`)</p>

<p><a href="/urlbacktick%60">maybe link and maybe not</a>`</p>

<p><code>[maybe link and maybe not](/urlbacktick</code>)</p>

The reference link is an interesting case too.


As said, IMO the code span should always have higher precedence than other inlines (because it's meant to be able to contain weird characters without those being interpreted as inline elements). So if the code span being recognised causes another inline to break because of overlap, or because it's in a section that can't contain any inlines then it should do just that.

(so I would vote for option (a) in your dichotomy above).

aidantwoods avatar Jun 04 '17 21:06 aidantwoods

@aidantwoods Maybe I was misunderstanding you. I thought you see the described interpretation as implied by current specification; not as a suggestion how it should be updated.

mity avatar Jun 04 '17 22:06 mity

@mity No worries, perhaps I could have made that clearer – apologies for the ambiguity.

aidantwoods avatar Jun 05 '17 12:06 aidantwoods

@mity

For the same reason, you should parse links in the same pass, just looking for the closing bracket instead of the opening one.

Well. Interesting idea. But as far as I can see it is almost equivalent to parsing it in the 2nd step. The only difference is when the 2nd part of link ((...), or [...] in case of full link references) overlaps with something from the group 1.

It is not quite equivalent. If you parse in two separate steps, a code span might "hide" the closing parenthesis, e.g.

[a](b`c)d`e

In this example, the first pass would create a "code span" with the content c)d, which would have to be un-parsed in the next step, in order to "find" the closing parenthesis. Parsing the link would then "destroy" the code span. Is this how you are doing it?

It gets even worse:

[a](b`c)d`e`f`g

In this case, the first step would generate the code spans c)d and f. If you un-parse (and destroy) the first code span, you are left with f. However, an author would probably expect (and the reference implementation produces) the code span e and a stray backtick between f and g.

This problem is not limited to overlapping markup, it also occurs if a full structural element is inside the URL part of the link, e.g.:

[a](b`c`d)

Again, in this case the code span c would have to be un-parsed in order to create the link destination "b`c`d".

The same can of course happen with autolinks, that's what led me to report issue #472, which can actually be explained with my comment above (https://github.com/jgm/CommonMark/issues/439#issuecomment-305976350).

If you parse for links in the same step as HTML etc. as I suggested above, all those problems go away. I'm quite sure the spec wants to suggest exactly this behavior, but it doesn't quite manage to bring it across.

I see both cases as quite a corner case.

Yes, all those are definitely corner cases. But IMHO the spec should be able to handle them unambiguously.

In the case A, I'm afraid the spec shall never really be really finished, it would be order(s) of magnitude longer and most of it would be about specifying deep strange combinations of characters no sane person would use in combination, following it to the letter shall become almost impossible for any new implementation,

I don't think so. I think it's actually easier to implement if there are clear instructions what a parser should do. Implementers are still not forced to implement it in any given way, they can do it however they please, as long as it behaves "as if" it would be implemented in the way the spec prescribes.

and the only sane way how to implement it would be by rewriting Cmark, line by line, because the specification would be so complicated noone would really understand it.

I don't agree. With my suggestion above (https://github.com/jgm/CommonMark/issues/439#issuecomment-305976350) I'm actually implementing a parser without looking at Cmark or commonmark.js and not even following the suggested parsing strategy from the spec and I'm quite convinced that it shows the same behavior as the reference implementation.

@aidantwoods

Notably, link text permits inline elements to be enclosed, while link destination does not.

I think that's the wrong way to look at it (I'm talking about the second part of the sentence).

I agree that link text should allow inline elements to be enclosed, but the link destination doesn't really care if something looks like an inline element or not. It just parses the raw input characters.

That's one of the reasons why parsing links should be part of the first step of inline parsing, as I mentioned above.

In particular the code span seems to hold higher precedence than all other inlines in general, so it should always be preferred if that means breaking the interpretation of an inline link (since the link destination does not permit sub-elements).

Well there are two conflicting objectives:

  • code spans should trump everything, in order to be able to contain any kind of less-powerful markup
  • link destinations should be able to contain any character (with a few exceptions that are not important here)

In cases where those two rules clash, there has to be some tie breaker. And I think it makes sense that whatever starts first in the input sequence, breaks the tie (same as if a code span and an HTML tag are clashing).

Now comes the crucial point: where do those things start? I would say that a code span starts with the opening backtick string and a link destination starts with the sequence ]( (plus optional whitespace). If you look at it like this, it's clear in which cases the code span wins and in which cases the link destination (and therefore the full inline/reference link) wins. And it's not ambiguous anymore.

This is also exactly the current behavior of the reference implementation. The only problem is that the spec isn't clear enough in explaining it.

The same rule can be applied to full reference links, except that instead of a link destination we are looking for a link label now (which can also include arbitrary characters, except unescaped brackets). Other than that, it works fine and is unambiguous.

But coming back to the un-parsing problem I mentioned above ... there is still something I find a bit strange: It wasn't yet mentioned above, but the IMHO the only strange situation appears with collapsed and shortcut reference links. There, the label (which may contain arbitrary characters) appears inside of the initial (or only) pair of brackets, e.g.:

[a`b`c]: url

text [a`b`c][] text.

Let's assume that code spans and links are parsed in the same step and links are "created" when encountering the first closing bracket (as I suggested above). This means that "`b`" is first converted to a code span containing b. Then the first closing bracket is encountered which may create a collapsed reference link. But in order to check the link label, the parser has to un-parse the code span in order to reveal the original character sequence "a`b`c", which turns out to indeed be an existing link label.

I find this un-parsing a bit awkward, but there's not really a problem implementing it.

mgeier avatar Jun 06 '17 14:06 mgeier

@mgeier

[a](b`c)d`e

In this example, the first pass would create a "code span" with the content c)d, which would have to be un-parsed in the next step, in order to "find" the closing parenthesis. Parsing the link would then "destroy" the code span. Is this how you are doing it?

No. MD4C would not destroy the code span. It would not recognize the link. It translates it to this:

<p>[a](b<code>c)d</code>e</p>

AFAICS, it is not against the specs as of now, and this behavior also has its merit.

In the case A, I'm afraid the spec shall never really be really finished, it would be order(s) of magnitude longer and most of it would be about specifying deep strange combinations of characters no sane person would use in combination, following it to the letter shall become almost impossible for any new implementation,

I don't think so. I think it's actually easier to implement if there are clear instructions what a parser should do. Implementers are still not forced to implement it in any given way, they can do it however they please, as long as it behaves "as if" it would be implemented in the way the spec prescribes.

Well, "make the specs to not leave any disambiguity or undefined behavior" is the kind of principle which sounds nice as long as it is just that: a generic principle. But show me not-overly-long and easy-to-understand wording for the specification which would fix just this link-related issue. (And keep in mind Markdown authors without any development skills are readers of the specification as well.)

Then I might happily agree with you (and if we ever meet I would buy you a beer for that ;-) )

If you parse for links in the same step as HTML etc. as I suggested above, all those problems go away.

Well, as said before I'm more afraid about problems of overly-complex specification for purposes of unimportant detail and that people will loose long hours when trying to understand the problem rather then how to deal with it after they finally understand it.

I simply think this problem is hard to describe and all the participants in this thread can understand it only because we all seemingly spent a considerable time on the implementation of the link parsing.

So again: Keep in mind the specs should be accessible for (almost) anybody. Try to explain your grandma how to write a simple link in markdown. And then try to explain her this...

mity avatar Jun 06 '17 14:06 mity

@mity small correction

This isn't something that can be backed up by the spec at this time (because as you've pointed out, code spans only have higher precedence than the link text brackets at present).

Turns out this isn't true (ping for #438),

Just under http://spec.commonmark.org/0.27/#example-319

Code span backticks have higher precedence than any other inline constructs except HTML tags and autolinks.

Which would mean that a link destination cannot contain something that looks like a code span, because the code span has higher precedence, and link destination cannot contain inlines.

@mgeier This would indicate that the only cases a code span should compete based on start position are when the conflicting element is either a HTML tag, or autolink?

I agree that link text should allow inline elements to be enclosed, but the link destination doesn't really care if something looks like an inline element or not. It just parses the raw input characters.

I'd agree with this in general, though in the case that the inner text that looks like an inline element also has higher precedence than the outer one, then the inner should be considered and not just treated as raw input.

aidantwoods avatar Jun 06 '17 14:06 aidantwoods

@mgeier

~~I now spent about an hour thinking about your suggestion from implementation point of view. That solution has one property which I dislike a lot.~~

~~Right now it is possible to implement parsing of various inline elements mostly locally. (MD4C proves it as it works that way). By that I mean that with, say, autolink or code span, there are simple and independent simple state machines which typically remember last seen unresolved opener or stack of unresolved openers, and when a (matching) closer is found, the range of characters since the last/matching opener is simply called to be autolink or code span.~~

~~Function handling the state machine for an inline element X does not touch anything but state machine for element X.~~

~~But with your suggestion (if I understand it well), it is not true anymore. When ] is seen, and the parser detects if really forms part of a link, it needs to kill the special meaning of preceding openers of different types (the autolink, the code span, the raw HTML) i.e. to interfere with other state machines.~~

~~This may look as a minor detail. But it makes the code worse for maintenance and much less scalable in some sense. If you ever need to support new inline element of the same priority (consider a future version of spec, or some extension), then the seemingly unrelated state machine processing the links needs to be updated to reset the new kind of opener as well.~~

~~Forgetting to do so would work on most input unless someone feeds the parser with input similar with the examples above. It can easily slip through testing unless the test suite contains all the conflicting cases. (And the CommonMark tests do not cover it even for the links vs. code spans and links vs. raw HTML and links vs. autolinks).~~

Sorry. Ignore it. This is invalid essay. There are already some cases this is needed and MD4C does them already so nothing that new.

mity avatar Jun 06 '17 16:06 mity

@mgeier

In cases where those two rules clash, there has to be some tie breaker. And I think it makes sense that whatever starts first in the input sequence, breaks the tie (same as if a code span and an HTML tag are clashing).

You'd have to be careful applying that as a general rule, some inlines have weirder intersection rules when interacting with themselves. The first starting valid element doesn't always take precedence, consider the following:

*foo *bar baz*

_foo *bar_ baz*

Where

<em>foo *bar baz</em>

does not precede

*foo <em>bar baz</em>

in the same way that

<em>foo *bar</em> baz*

precedes

_foo <em>bar_ baz</em>

Even though different inlines started first, they have different conflict resolution algorithms based on specifics about the element.

aidantwoods avatar Jun 06 '17 17:06 aidantwoods

@mity

AFAICS, it is not against the specs as of now,

It depends on whether you consider the "parsing strategy" part of the spec or not.

and this behavior also has its merit.

From an implementor's POV, probably (but not necessarily). But also from a user's POV?

The problem is that a single backtick in a link destination or link title might interact with another backtick somewhere far away (assuming quite a long paragraph).

Sure, it's an exotic error, but it's still one more error than if link destinations have the same priority as code spans.

I still think this is open for discussion and both the spec and the reference implementations might be changed, but I think we should definitely aim for an unambiguous behavior for spec 1.0.

If it is decided to favor code spans always before any part of a link, that's not a big deal for me, and it's a minor change in my implementation. It would be a much bigger simplification though, if we could also get rid of the un-parsing of link labels that I mentioned above (but I don't think that's possible).

But show me not-overly-long and easy-to-understand wording for the specification which would fix just this link-related issue.

It is already described in the parsing strategy: http://spec.commonmark.org/0.27/#an-algorithm-for-parsing-nested-emphasis-and-links

It sure can be written more concisely and in a more general way so that extensions can have well-defined behavior, too. And I will make a suggestion eventually (if nobody beats me to it), but first I have to make my own parser feature-complete (to be sure that I'm not missing anything).

@aidantwoods

You'd have to be careful applying that as a general rule, some inlines have weirder intersection rules when interacting with themselves. The first starting valid element doesn't always take precedence

That's exactly what I'm trying to say: the "active" character doesn't have to be the first character of an inline element!

For code spans, HTML tags and autolinks it happens to be the first character, but for links it is the first closing bracket and for emphasis it is any "potential closer" (which can be either a sequence of stars or underscores). Any extension that wants to take part in the priority game, can choose their "active" character(s) (or sequence(s) thereof) however they want.

mgeier avatar Jun 07 '17 14:06 mgeier

and this behavior also has its merit.

From an implementor's POV, probably (but not necessarily). But also from a user's POV?

The problem is that a single backtick in a link destination or link title might interact with another backtick somewhere far away (assuming quite a long paragraph).

The purpose of code span is to contain code. Quite often, the code may contain characters which would be otherwise very meaningful for Markdown parser, including what may look as an end of a link.

Also note there is no way how to escape anything inside the code span, while it is possible to escape backtick everywhere else (it may be replaced with %60 in link destination or escaped with \ in link title).

From this POV, even for user, it may be good to make code span higher priority then most other inline elements, especially ones with complicated syntax as links.

mity avatar Jun 07 '17 14:06 mity

@mgeier

That's exactly what I'm trying to say: the "active" character doesn't have to be the first character of an inline element!... for emphasis it is any "potential closer" (which can be either a sequence of stars or underscores).

This won't work for emphasis because the "potential closer" can be the same character for multiple different valid emphases. The way in which emphasis is chosen is heavily dependent on how the emphases intersect.

Possibly we should stray away from talks on specific ways of parsing though, because as soon as you make parsing details part of the spec you've picked a specific implementation. Would prefer to focus on what should take priority, and in which cases – then different implementations can decide the best way to do that in practice.

aidantwoods avatar Jun 07 '17 18:06 aidantwoods

@mity

The purpose of code span is to contain code. Quite often, the code may contain characters which would be otherwise very meaningful for Markdown parser, including what may look as an end of a link.

And that's not a problem, except in totally contrived examples. It's actually quite hard to come up with such a situation, did you try?

You need all of this to happen accidentally, all in the same paragraph, in this order:

  • an unmatched, unescaped opening bracket that's not in a code span: [
  • the exact sequence ]( (also unescaped and outside of a code span), possibly followed by whitespace
  • immediately following that, your code span including the closing parenthesis )
    • the code span may only contain one space immediately before the ), except in a valid link title markup that happens to be there (if the spec is changed to allow spaces in link destinations, there may be more spaces)

Note that the ]( will be very close to your code span, therefore it will be quite easy to see what's going wrong.

I could come up with this example:

text [text
text text
text text
text text
text text text ](
`this-is-the-character-at-the-end-of-a-link-----> )`

Another, very similar example:

text [text
...
text text text ](
`)` is a beautiful punctuation mark.

I consider those examples very unrealistic, can you come up with a more realistic one?

I think it's more realistic that your suggestion goes wrong, here's one example:

[backtick](http://unicode.org/cldr/utility/character.jsp?a=`)
text text
text text
text text
text text
text `code` text

It only needs, all in the same paragraph, in this order:

  • a link destination containing an unmatched backtick: `
  • a code span

Those things can be very far apart in a long paragraph, making the error hard to see.

And sure, the author could manually replace the backtick by %60, but it would still be nicer if it wouldn't have to be replaced. Ideally, you should be able to paste URLs from the browser without modifications.

Also note there is no way how to escape anything inside the code span

And it's really not needed. You only need to violate one of the points I mentioned above, e.g. you could backslash escape the [ or the ](.

@aidantwoods

[...] the "active" character doesn't have to be the first character of an inline element! [...] for emphasis it is any "potential closer" (which can be either a sequence of stars or underscores).

This won't work for emphasis because the "potential closer" can be the same character for multiple different valid emphases. The way in which emphasis is chosen is heavily dependent on how the emphases intersect.

It will work and it does work. That's what the reference implementations do. It works also in more modular and ~exception~ extension-friendly implementations, like the one I'm currently working at.

The "potential closer" is matched with the innermost matching "potential opener", then with the next one outside of it, then with the next one and so on until there are no delimiters left.

Possibly we should stray away from talks on specific ways of parsing though, because as soon as you make parsing details part of the spec you've picked a specific implementation.

The spec could probably describe a "prototypical parser" and different implementations can "emulate" it however they want. But this is not the issue here, this IMHO deserves a separate issue or a discussion on https://talk.commonmark.org/.

Here we should discuss if backticks should always win or if link destinations and link titles should have the same priority. I think the arguments so far brought up here speak for the latter, but I'm eager to hear more arguments.

mgeier avatar Jun 08 '17 10:06 mgeier

@mgeier

[...] the "active" character doesn't have to be the first character of an inline element! [...] for emphasis it is any "potential closer" (which can be either a sequence of stars or underscores).

This won't work for emphasis because the "potential closer" can be the same character for multiple different valid emphases. The way in which emphasis is chosen is heavily dependent on how the emphases intersect.

It will work and it does work. That's what the reference implementations do. It works also in more modular and exception-friendly implementations, like the one I'm currently working at.

The "potential closer" is matched with the innermost matching "potential opener", then with the next one outside of it, then with the next one and so on until there are no delimiters left.

The extra bit tacked on at the end there demonstrates that it doesn't work on its own – you're using an extra condition to choose which emph to render.

i.e.

*foo *bar baz*

Whether or not you see two emphs here (start looking at the opener), and pick one, or start at the closer and work backwards to find the opener is implementation specific.

If you do the latter, then you'll have to pick one between these, in which case – sure prioritising on the closer makes sense:

*foo bar* baz*

Just for the sake of getting back on topic 😉 I don't think we're in disagreement about whether this is implementation specific, are we? That's my main point here, and really the only point of the example.


The spec could probably describe a "prototypical parser" and different implementations can "emulate" it however they want. But this is not the issue here, this IMHO deserves a separate issue or a discussion on https://talk.commonmark.org/.

It could. In some respects it might be much easier to catch problems if it was written this way, though the current way it is written also has its benefits. And yes, I would also agree that it deserves its own discussion. If the spec were adjusted in this way, then talking about implementation details like an "active character" would make sense, but only in this case (which is a separate discussion as you've said).

Here we should discuss if backticks should always win or if link destinations and link titles should have the same priority.

Yup. Agree with this 😄

I think the arguments so far brought up here speak for the latter, but I'm eager to hear more arguments.

The latter (i.e. code spans are more important) is my opinion (and as pointed out, it is the current position of the spec too)

Code span backticks have higher precedence than any other inline constructs except HTML tags and autolinks.

IMO there are two reasons for this:

  1. Code spans almost necessarily contain weird combinations of characters, I think this should hold more weight in general than the fact that examples are contrived
  2. Because code spans are meant to contain everything you put in them, literally, no escaping is supported within them. Meaning that to break an inline of equal or higher precedence that started before the code span, you can only do so by going back and escaping one of the important characters in the previous element (since escaping the latter characters will give you backslashes in your code span). Every other type of inline supports escaping characters in its contents with no effect on output.

aidantwoods avatar Jun 08 '17 11:06 aidantwoods

@aidantwoods

We should move the discussion about emphasis parsing and how the spec should specify parsers without talking about parsing to a later time and a different place.

The sentence that you are quoting from the spec ("Code span backticks have higher precedence than any other inline constructs except HTML tags and autolinks") is IMHO erroneous. It directly contradicts the "parsing strategy" described in the end of the spec ~as well as examples 323 and 325 (and probably others?)~ [UPDATE: the examples are fine]. And it does not describe the behavior of the reference parsers (which apparently use this "parsing strategy"). As I mentioned in my very first comment to this here issue (https://github.com/jgm/CommonMark/issues/439#issuecomment-305976350), the spec should say that inline links should have the same priority (knowing that in reality it's the part with the link destinations and link titles that actually has the same priority). The exact wording for this is still to be discussed.

In the comments above, we were not talking about HTML tags and autolinks. But in order to take care of your (and @mity's) concerns above, we also have to talk about those. To address your concerns from above, the spec would have to be changed to say that code span backticks must have the highest priority always, even higher than HTML tags and autolinks. Do you really want that?

Following your reasoning above, examples 323 and 325 should change to code spans. Similar to my examples above, I've also prepared an example for this:

<foo bar="a`b"> text
text text
text text
text `code` text

@mity MD4C parses this like the reference implementation, which goes against what you have been saying above.

@mity and @aidantwoods: If you really mean it, you should also request that backticks get higher priority than HTML tags and autolinks.

@aidantwoods

Code spans almost necessarily contain weird combinations of characters

Yes, I acknowledge that. But it's not a problem, as I showed with my examples above. It's not a problem with regards to HTML tags or autolinks, nor is it a problem for inline/reference links.

Note that URLs and HTML attributes may contain weird characters, too. I know they can be escaped, but in real life people probably don't do that all the time, and browsers don't necessarily do it if you copy a link from them.

And I hope that I could show with my examples that a strange character in an URL (or some HTML attribute) is far more likely to cause unexpected results.

Every other type of inline supports escaping characters in its contents with no effect on output.

I acknowledge that, too. But again, as I showed in my example above, it's not a problem. Those cases are extremely unlikely (very close to impossible) to occur in real texts. And as you say yourself (and as I said above: https://github.com/jgm/CommonMark/issues/439#issuecomment-307070791), escaping characters inside code spans isn't necessary in those rare cases, because you can just escape the offending characters before (as I mentioned above, those are [ or ](. Let me add ][ here for the case of full reference links, which I forgot above, and < or : or = or " for the case of HTML tags and autolinks).

mgeier avatar Jun 09 '17 10:06 mgeier