commonmark.js icon indicating copy to clipboard operation
commonmark.js copied to clipboard

Incorrect handling of tabs in link resources compared to spaces, newlines

Open wooorm opened this issue 4 years ago • 13 comments

  • Spec currently defines whitespace can exist, and in some cases must exist around the destination and title, in the parens: https://spec.commonmark.org/0.29/#inline-link
  • I clarified the wording recently in https://github.com/commonmark/commonmark-spec/pull/618
  • But whitespace was used since the start: https://github.com/commonmark/commonmark-spec/blame/858a28941d0dd17c24b7240f21372652111bd38b/spec.txt#L7495

The bug probably stems from here: https://github.com/commonmark/commonmark.js/blob/8c698a295f7cea58597769ccfaab2219add45e44/lib/inlines.js#L651, which seems to include spaces and newlines but not tabs.


This bug can be reproduced with the following permalink to the dingus: https://spec.commonmark.org/dingus/?text=tab%3A%20%5Bx%5D(%09y)%0Aspace%3A%20%5Bx%5D(%20y)%0A%0Atab%3A%20%5Bx%5D(y%09)%0Aspace%3A%20%5Bx%5D(y%20)%0A%0Atab%3A%20%5Bx%5D(%09%3Cy%3E)%0Aspace%3A%20%5Bx%5D(%20%3Cy%3E)%0A%0Atab%3A%20%5Bx%5D(y%09%22z%22)%0Aspace%3A%20%5Bx%5D(y%20%22z%22)%0A.

The expected behavior is that tabs and spaces behave the same.

wooorm avatar May 28 '20 09:05 wooorm

It looks like all that would be required is to change one line in inlines.js: https://github.com/commonmark/commonmark.js/blob/master/lib/inlines.js#L74

var reSpnl = /^ *(?:\n *)?/;

to

var reSpnl = /^[ \t]*(?:\n[ \t]*)?/;

but I'v not built commonmark.js before, so I'm not sure how to go about testing it.

mikeando avatar Jun 23 '22 13:06 mikeando

Actually, that new regex only covers this one case, but the 0.30 spec says we should interpret "whitespace" as:

A Unicode whitespace character is any code point in the Unicode Zs general category, or a tab (U+0009), line feed (U+000A), form feed (U+000C), or carriage return (U+000D).

so while we're in here we should add those.

mikeando avatar Jun 23 '22 13:06 mikeando

It looks like all of the whitespace definitions in that part of the code could do with a cleanup.

var reWhitespaceChar = /^[ \t\n\x0b\x0c\x0d]/;
var reUnicodeWhitespaceChar = /^\s/;
var reFinalSpace = / *$/;
var reInitialSpace = /^ */;
  • reWhitespaceChar includes (U+000B), which it should not.
  • reUnicodeWhitespacChar \s includes (U+000B), which it should not - and I believe the regex is not unicode aware.
  • Final space and initial space should also allow other space characters.

I'm not even sure that we need to differentiate between unicode-white-space and white-space, the spec seems to have changed between 0.29 and 0.30 to only have the definition for unicode-white-space.

mikeando avatar Jun 23 '22 13:06 mikeando

Where did you see that this case should be covered by unicode whitespace? I don’t believe it is. I believe only emphasis/strong use unicode whitespace.

wooorm avatar Jun 23 '22 13:06 wooorm

Where did you see that this case should be covered by unicode whitespace?

The spec says "white-space" and the only definition for whitespace is now "unicode whitespace". In 0.29 there was a definition for whitespace in the same location, but it was removed.

Maybe ascii-whitespace was intended, and the definition should not have been removed... IIRC in 0.29 whitespace did include (U+000B) so the changes would be fewer.

mikeando avatar Jun 23 '22 13:06 mikeando

In 0.29 it says whitespace. In 0.30 it doesn’t. I believe I fixed that confusion. Or, where do you see whitespace?

wooorm avatar Jun 23 '22 13:06 wooorm

You're right the spec does explicitly call it out as spaces/tabs in links.

These four components may be separated by spaces, tabs, and up to one line ending.

mikeando avatar Jun 23 '22 13:06 mikeando

In that case, my initial fix should be sufficient I think.

I'm miss-remembering from some other issues I've been digging into for space handling in HTML elements... sorry.

mikeando avatar Jun 23 '22 13:06 mikeando

Yeah. I also found this confusing. That’s why I fixed it :) In markdown, only [ \t], and in some cases \r\n|\n|\r are considered as “whitespace”. In markdown, only ASCII punctuation is considered as punctuation. Except for emphasis/strong, which checks for unicode whitespace and unicode punctuation.

wooorm avatar Jun 23 '22 13:06 wooorm

This is wandering off-topic a little - but this is enlightening for me.

Does this mean the current parsing of <T\b> as an HTML entry is wrong? The \b cant be part of the tag-name (since it can only be ASCII digits, letters or hypen). It similarly can't form an attribute-name as that must start with an ascii letter, and it is also not a space or tab which is allowed between the elements.

cmark, commonmark.js and comrak (rust parser) all accept it as HTML though...

mikeando avatar Jun 23 '22 14:06 mikeando

I believe that cmark, commonmark.js, and comrak, are incorrect for accepting it as HTML according to the CommonMark spec. Note though, that Unicode is allowed in markdown. There are also cases where any character is allowed. That is to say, certain states check for X characters and so something, and then for Y characters to do something else, and finally allow any other character. The unquoted attribute value is such a case. Though, your case can’t be an unquoted attribute value, as there is not attribute name.

wooorm avatar Jun 23 '22 14:06 wooorm

@mikeando want to submit a PR for your simple fix above?

jgm avatar Jun 23 '22 16:06 jgm

@jgm PR here: https://github.com/commonmark/commonmark.js/pull/259

mikeando avatar Jun 23 '22 23:06 mikeando