tree-sitter
tree-sitter copied to clipboard
Leading whitespace becomes part of tokens
Problem
I've noticed that sometimes leading whitespace becomes part of a token, even when that token's rule doesn't include any whitespace.
For example, for sample code such as
fo
nil
a : b
the minimal grammar included below will create this tree:
(expressions [0, 0] - [3, 0]
(identifier [0, 0] - [0, 6])
(identifier [1, 0] - [1, 6])
(type_decl [2, 0] - [2, 10]
(identifier [2, 0] - [2, 4])
(identifier [2, 6] - [2, 10])))
The first identifier is parsed as (identifier [0, 0] - [0, 6]) when it should be (identifier [0, 4] - [0, 6]). The other identifiers are in a similar situation.
When narrowing this grammar down to a minimal example, I noticed that changing the rules which contain whitespace will change this behavior. If the ' :' terminal in $.type_decl is changed to ':', then leading whitespace is no longer attached to identifiers.
The grammar I'm working on does require whitespace in some terminals (to ensure separation between syntax elements), and I think a simple case like this should be supported without requiring the external scanner.
Steps to reproduce
Here's a minimal grammar that reproduces the issue:
module.exports = grammar({
name: 'demo',
rules: {
expressions: $ => optional($._statements),
_statements: $ => repeat1(seq($._expression, '\n')),
_expression: $ => choice($.type_decl, $.identifier),
type_decl: $ => seq($.identifier, ' :', $.identifier),
identifier: $ => /[a-z_]+/,
},
})
Expected behavior
The parse tree I would expect the given grammar to generate, without leading whitespace in the ranges:
(expressions [0, 4] - [3, 0]
(identifier [0, 4] - [0, 6])
(identifier [1, 3] - [1, 6])
(type_decl [2, 3] - [2, 10]
(identifier [2, 3] - [2, 4])
(identifier [2, 9] - [2, 10])))
Tree-sitter version (tree-sitter --version)
tree-sitter 0.24.6 (21a517c423010811147b0b1aa1e7aedc39ce91aa)
Operating system/version
Arch Linux 6.12.7
Probably the same issue, #3966
@clason Could you please explain why the bug tag was removed?
Is there some grammar writing principles that we can follow to avoid this kind of issues?
The tag was removed because the type was added.