tree-sitter icon indicating copy to clipboard operation
tree-sitter copied to clipboard

Leading whitespace becomes part of tokens

Open keidax opened this issue 10 months ago • 3 comments

Problem

I've noticed that sometimes leading whitespace becomes part of a token, even when that token's rule doesn't include any whitespace.

For example, for sample code such as

    fo
   nil
   a :   b

the minimal grammar included below will create this tree:

(expressions [0, 0] - [3, 0]
  (identifier [0, 0] - [0, 6])
  (identifier [1, 0] - [1, 6])
  (type_decl [2, 0] - [2, 10]
    (identifier [2, 0] - [2, 4])
    (identifier [2, 6] - [2, 10])))

The first identifier is parsed as (identifier [0, 0] - [0, 6]) when it should be (identifier [0, 4] - [0, 6]). The other identifiers are in a similar situation.

When narrowing this grammar down to a minimal example, I noticed that changing the rules which contain whitespace will change this behavior. If the ' :' terminal in $.type_decl is changed to ':', then leading whitespace is no longer attached to identifiers.

The grammar I'm working on does require whitespace in some terminals (to ensure separation between syntax elements), and I think a simple case like this should be supported without requiring the external scanner.

Steps to reproduce

Here's a minimal grammar that reproduces the issue:

module.exports = grammar({
  name: 'demo',

  rules: {
    expressions: $ => optional($._statements),
    _statements: $ => repeat1(seq($._expression, '\n')),
    _expression: $ => choice($.type_decl, $.identifier),
    type_decl: $ => seq($.identifier, ' :', $.identifier),
    identifier: $ => /[a-z_]+/,
  },
})

Expected behavior

The parse tree I would expect the given grammar to generate, without leading whitespace in the ranges:

(expressions [0, 4] - [3, 0]
  (identifier [0, 4] - [0, 6])
  (identifier [1, 3] - [1, 6])
  (type_decl [2, 3] - [2, 10]
    (identifier [2, 3] - [2, 4])
    (identifier [2, 9] - [2, 10])))

Tree-sitter version (tree-sitter --version)

tree-sitter 0.24.6 (21a517c423010811147b0b1aa1e7aedc39ce91aa)

Operating system/version

Arch Linux 6.12.7

keidax avatar Jan 10 '25 03:01 keidax

Probably the same issue, #3966

blindFS avatar Feb 03 '25 14:02 blindFS

@clason Could you please explain why the bug tag was removed? Is there some grammar writing principles that we can follow to avoid this kind of issues?

blindFS avatar Apr 29 '25 07:04 blindFS

The tag was removed because the type was added.

clason avatar Apr 29 '25 07:04 clason