tree-sitter-regex
tree-sitter-regex copied to clipboard
Unicode escapes
Consider this javascript regexp, containing unicode codepoint escapes:
const RE = /[\u005D-\uFFFF]/;
Which parses up like:
(program ; [0, 0] - [1, 0]
(lexical_declaration ; [0, 0] - [0, 29]
(variable_declarator ; [0, 6] - [0, 28]
name: (identifier) ; [0, 6] - [0, 8]
value: (regex ; [0, 11] - [0, 28]
pattern: (regex_pattern ; [0, 12] - [0, 27]
(pattern ; [0, 12] - [0, 27]
(term ; [0, 12] - [0, 27]
(character_class ; [0, 12] - [0, 27]
(identity_escape) ; [0, 13] - [0, 15]
(class_character) ; [0, 15] - [0, 16]
(class_character) ; [0, 16] - [0, 17]
(class_character) ; [0, 17] - [0, 18]
(class_range ; [0, 18] - [0, 22]
(class_character) ; [0, 18] - [0, 19]
(identity_escape)) ; [0, 20] - [0, 22]
(class_character) ; [0, 22] - [0, 23]
(class_character) ; [0, 23] - [0, 24]
(class_character) ; [0, 24] - [0, 25]
(class_character))))))))) ; [0, 25] - [0, 26]
I might expect something more like
(program
(lexical_declaration
(variable_declarator
name: (identifier)
value: (regex
pattern: (regex_pattern
(pattern
(term
(character_class
(unicode_codepoint_escape)
(class_range
(unicode_codepoint_escape))))))))))
It may be necessary to default to identity escapes in cases where the unicode flag is not present.
See https://github.com/bennypowers/nvim-regexplainer/issues/44