multiply applied capture groups seems to ignore some captures

Open asottile opened this issue 5 years ago • 3 comments

a bit of an edge case, I'm not sure how this is supposed to be handled -- I don't have a concrete use case, just trying to implement my own parser in python using this as a reference

sample grammar

{
    "scopeName": "test",
    "patterns": [
        {
            "match": "((a)) ((b) c) (d (e)) ((f) )",
            "name": "matched",
            "captures": {
                "1": {"name": "g1"},
                "2": {"name": "g2"},
                "3": {"name": "g3"},
                "4": {"name": "g4"},
                "5": {"name": "g5"},
                "6": {"name": "g6"},
                "7": {
                    "patterns": [
                        {"match": "f", "name": "g7f"},
                        {"match": " ", "name": "g7space"}
                    ]
                },
                "8": {"name": "g8"}
            }
        }
    ]
}

sample file

a b c d e f z

tokenization using vs code

$ node vsc.js cap.json f

Tokenizing line: a b c d e f z
 - token from 0 to 1 (a) with scopes test, matched, g1, g2
 - token from 1 to 2 ( ) with scopes test, matched
 - token from 2 to 3 (b) with scopes test, matched, g3, g4
 - token from 3 to 5 ( c) with scopes test, matched, g3
 - token from 5 to 6 ( ) with scopes test, matched
 - token from 6 to 8 (d ) with scopes test, matched, g5
 - token from 8 to 9 (e) with scopes test, matched, g5, g6
 - token from 9 to 10 ( ) with scopes test, matched
 - token from 10 to 11 (f) with scopes test, matched, g7f
 - token from 11 to 12 ( ) with scopes test, matched, g7space
 - token from 12 to 14 (z) with scopes test

I expect the f to have the scope test, matched, g7f, g8:

>>> # ...
>>> state, regions = highlight_line(compiler, state, 'a b c d e f z', first_line=True)
>>> import pprint
>>> pprint.pprint(regions)
(Region(start=0, end=1, scope=('test', 'matched', 'g1', 'g2')),
 Region(start=1, end=2, scope=('test', 'matched')),
 Region(start=2, end=3, scope=('test', 'matched', 'g3', 'g4')),
 Region(start=3, end=5, scope=('test', 'matched', 'g3')),
 Region(start=5, end=6, scope=('test', 'matched')),
 Region(start=6, end=8, scope=('test', 'matched', 'g5')),
 Region(start=8, end=9, scope=('test', 'matched', 'g5', 'g6')),
 Region(start=9, end=10, scope=('test', 'matched')),
 Region(start=10, end=11, scope=('test', 'matched', 'g7f', 'g8')),
 Region(start=11, end=12, scope=('test', 'matched', 'g7space')),
 Region(start=12, end=13, scope=('test',)))

Mar 11 '20 03:03 asottile

I have tried also in TextMate and they appear to handle this in the way you expect:

Here is the grammar converted to TextMate's format:

{	patterns = (
		{	
			match = "((a)) ((b) c) (d (e)) ((f) )";
			name = "matched";
			captures = {
				1 = { name = "g1"; };
				2 = { name = "g2"; };
				3 = { name = "g3"; };
				4 = { name = "g4"; };
				5 = { name = "g5"; };
				6 = { name = "g6"; };
				7 = {
					patterns = (
						{ match = "f"; name = "g7f"; },
						{ match = " "; name = "g7space"; },
					);
				};
				8 = { name = "g8"; };
			};
		},
	);
}

Mar 11 '20 07:03 alexdima