tree-sitter-regex Unicode escapes

Unicode escapes

Open bennypowers opened this issue 9 months ago • 0 comments

Consider this javascript regexp, containing unicode codepoint escapes:

const RE = /[\u005D-\uFFFF]/;

Which parses up like:

(program ; [0, 0] - [1, 0]
  (lexical_declaration ; [0, 0] - [0, 29]
    (variable_declarator ; [0, 6] - [0, 28]
      name: (identifier) ; [0, 6] - [0, 8]
      value: (regex ; [0, 11] - [0, 28]
        pattern: (regex_pattern ; [0, 12] - [0, 27]
          (pattern ; [0, 12] - [0, 27]
            (term ; [0, 12] - [0, 27]
              (character_class ; [0, 12] - [0, 27]
                (identity_escape) ; [0, 13] - [0, 15]
                (class_character) ; [0, 15] - [0, 16]
                (class_character) ; [0, 16] - [0, 17]
                (class_character) ; [0, 17] - [0, 18]
                (class_range ; [0, 18] - [0, 22]
                  (class_character) ; [0, 18] - [0, 19]
                  (identity_escape)) ; [0, 20] - [0, 22]
                (class_character) ; [0, 22] - [0, 23]
                (class_character) ; [0, 23] - [0, 24]
                (class_character) ; [0, 24] - [0, 25]
                (class_character))))))))) ; [0, 25] - [0, 26]

I might expect something more like

(program
  (lexical_declaration
    (variable_declarator
      name: (identifier)
      value: (regex
        pattern: (regex_pattern
          (pattern
            (term
              (character_class
                (unicode_codepoint_escape)
                (class_range
                  (unicode_codepoint_escape))))))))))

It may be necessary to default to identity escapes in cases where the unicode flag is not present.

See https://github.com/bennypowers/nvim-regexplainer/issues/44

May 26 '24 06:05 bennypowers

tree-sitter-regex tree-sitter-regex copied to clipboard

Unicode escapes

tree-sitter-regex
tree-sitter-regex copied to clipboard