tree-sitter-cli icon indicating copy to clipboard operation
tree-sitter-cli copied to clipboard

tree-sitter crashes because of U+FEFF in test file

Open th-we opened this issue 7 years ago • 3 comments

When running tree-sitter test on this repo, the command will hang and eat more and more memory until it crashes.

In the test case, there is a "ZERO WIDTH NO-BREAK SPACE" (U+FEFF) character before the test code. When removing it, the test passes.

I also tried with tree-sitter-cli compiled from the git master branch instead of from the version published on npmjs.com, but results were identical.

th-we avatar Nov 20 '18 22:11 th-we

Actually, it turns out that while the U+FEFF is a hard to spot problem, it doesn't have to be something this obscure. I added another test case, which also shows the problem:

============================================
Crashing test
============================================

{}

---

(data)

th-we avatar Nov 20 '18 22:11 th-we

I haven't had a chance yet to clone this locally and reproduce, but I think you're probably hitting https://github.com/tree-sitter/tree-sitter/issues/98. I just now remembered to change the title of that issue, since it was originally pretty obscure.

I noticed that there's what looks to be an empty string token in your grammar: https://github.com/th-we/tree-sitter-crash-demo/blob/master/grammar.js#L7. Is that intentional? The usual way to write that construct with Tree-sitter would be:

data: $ => $._quoted_value,
_quoted_value: $ => seq('"', optional($.value), '"'),
value: $ => /a[^"]*/

That way, you wouldn't get nodes of zero size.

maxbrunsfeld avatar Nov 20 '18 23:11 maxbrunsfeld

While it's true that with optional() the problem does not occur, the tree would change, i.e. there would be no value node for the empty value (""). I used to include the quotes in the value nodes, but those nodes should be the injection point for another grammar.

I see the following workarounds:

  • Include the quotes in the injected grammar. But that would be a bit like including <script> tags in the JavaScript grammar.
  • Introduce a special emtpy_value node i.e. like
    data: $ => choice($._quoted_value, $.empty_value),
    _quoted_value: $ => seq('"', optional($.value), '"'),
    value: $ => /a[^"]*/,
    emtpy_value: $ => '""'
    
    It's inconsistent to include the quotes in one case and exclude them in another, but in the grammar cson file it can at least be made sure that the quotes are always highlighted the same.

There might be more tree-sitterish ways to solve this.

And by the way: Many thanks for this work of yours. It gives me basic linting for an obscure language with a number of pitfalls, some of which I can catch using tree-sitter now.

th-we avatar Nov 21 '18 09:11 th-we