markdown-tm-language icon indicating copy to clipboard operation
markdown-tm-language copied to clipboard

Syntax highlight breaks inside CommonMark list syntax (`1.`s) when surrounding render-unaffective indentation is not 0.

Open RokeJulianLockhart opened this issue 7 months ago • 12 comments

Per the downstream report: ^1

Describe the bug

  1. Source

    When nested lists are represented in source format:

    # Addition and Subtraction of Binary
    
    1.	## Questions
    
    	1.	1.	## Comprehension
    

    ...they're not syntax-highlighted consistently:

    1. Image

    2. Image

  2. Rendered

    However, it's valid; it even renders:

    1. Image

    2. Image

Expected behaviour

The # and ## should remain highlighted as <h[1-2]>s, and the 1.s should remain highlighted as <li>s.

Related discussion

vscode/issues/248017#issuecomment-2848330308

Additional notes

@RedCMD

RokeJulianLockhart avatar May 03 '25 14:05 RokeJulianLockhart

lists indented with tabs are currently broken in this grammar

1. # no-indent
    1. # Spaces
2. # no-indent
	1.	#	Tab
  1. no-indent

    1. Spaces

  2. no-indent

    1. Tab

Image

RedCMD avatar May 03 '25 23:05 RedCMD

https://github.com/wooorm/markdown-tm-language/issues/13#issuecomment-2848862087

@RedCMD, I didn't actually know that a tab could replace a space after a heading designator. Can a tab replace a space anywhere in CommonMark?

RokeJulianLockhart avatar May 04 '25 08:05 RokeJulianLockhart

both github and vscode seem to support so I updated the above comment

RedCMD avatar May 04 '25 08:05 RedCMD

PR welcome, but I would strongly recommend against using hard tabs in markdown. Tabs are good when whitespace does not matter. Tabs do not work well when whitespace does matter. And markdown is a whitespace sensitive language. Markdown has a hardcoded tab size of 4. The whole point of tabs is for it to be different than a hardcoded value.

wooorm avatar May 06 '25 08:05 wooorm

This may also be impossible with textmate grammars. I have super serious doubts that regexes can match the logic that is needed for markdown here.

wooorm avatar May 06 '25 08:05 wooorm

Markdown has a hardcoded tab size of 4. The whole point of tabs is for it to be different than a hardcoded value.

@wooorm, that differs. Sometimes the length is significantly longer:

1. This is a valid 3-space list, as is: [^citation_name]

   * This 2-space cutie. [^small]

     Hey!

[^citation_name]: This is the first line.

                  This is the second line.

[^small]: This is the first line.

          This is the second line.

It is in complex markup when the tab demonstrates its worth, since I can use one tab for all of these situations, and it's entirely valid markup.

RokeJulianLockhart avatar May 06 '25 10:05 RokeJulianLockhart

@wooorm https://github.com/microsoft/vscode-markdown-tm-grammar handles it correctly

Image

is it not as simple as replacing => [ \t]?

vs Github:

1. # no-indent
    1. # Spaces
2. # no-indent
	1.	#	Tab

RedCMD avatar May 06 '25 10:05 RedCMD

It is in complex markup when the tab demonstrates its worth, since I can use one tab for all of these situations, and it's entirely valid markup.

Your examples have no tabs. I do not understand them. Half of it is footnotes, which is a GFM feature very different from lists. Please read the markdown spec 2.2 on tabs: https://spec.commonmark.org/0.31.2/#tabs. Please also see 5.2 list items: https://spec.commonmark.org/0.31.2/#list-items. It is very complex.

handles it correctly

It handles this example “correctly” because it handles many cases incorrectly.

is it not as simple as replacing => [ \t]?

No, it very much is not that. See https://github.com/wooorm/markdown-tm-language/blob/c78b1e5df644d24fa76716bbe26f4b48a6fc1610/grammar.yml#L863 and the many lines under it.

wooorm avatar May 06 '25 14:05 wooorm

Your examples have no tabs. I do not understand them.

@wooorm, with tabs, they would be:

1.	This is a valid 3-space list, as is: [^citation_name]

	*	This 2-space cutie. [^small]

		Hey!

[^citation_name]:	This is the first line.

	This is the second line.

[^small]:	This is the first line.

	This is the second line.

...rendered as:

Image

It handles this example “correctly” because it handles many cases incorrectly.

Does this situation directly relate to those unstated examples?

RokeJulianLockhart avatar May 06 '25 15:05 RokeJulianLockhart

Thanks for providing an example with tabs. Though, still, halve of it is about footnotes, which are different, unrelated to this issue. Please always removing every unrelated character from example cases. Secondly, I did already provide all the sources for you should stop using tabs, and this cannot be implemented correctly. But I will try and walk you through them.

It would be good to look at what that first tab means: how “big” is it? That can be visualized as such:

1.	a

        b

       c

      d

     e

    f

   g

  h

 i

j

Yields:

  1. a

     b
    
    c
    

    d

    e

    f

    g

h

i

j

Note that b becomes indented code (because 8 spaces); g/h/i/j become paragraphs (because less than 4). Now, I ask you to change that one tab with spaces. Try 1 space. Try 2, 3, 4, 5 spaces. Also try with a tab but spaces before the 1.. What happens then?

Importantly, also try the the different syntax highlighters.

I hope this gives you a better mental model of the complexity of the whitespace-sensitive markdown parser, and the magic value of 4.

wooorm avatar May 06 '25 17:05 wooorm

are you saying that Markdown always treats tabs completely interchangeable with 4 spaces?

so you can mix space tab space with tab space space and space 6x etc?

RedCMD avatar May 06 '25 22:05 RedCMD

https://github.com/wooorm/markdown-tm-language/issues/13#issuecomment-2856229462

@RedCMD, it should treat a tab as interchangeable with 4 spaces, and has in my experience.

https://github.com/wooorm/markdown-tm-language/issues/13#issuecomment-2855381232

@wooorm, I am thankful for the effort, although I can't say that I understand those examples. Since a tab should always correspond to the default indentation width (4 spaces), its width should depend upon the context. If looking at a list of 1.s, the user would set it to 3 em.

RokeJulianLockhart avatar May 07 '25 16:05 RokeJulianLockhart