tree-sitter-typst
tree-sitter-typst copied to clipboard
Bug causing memory corruption
In Helix editor, when writing any file format that support embedded languages like raw blocks in markdown or typst:
```typst
_
-
```
It causes a segfault when the embedded Typst code contains delimited context like _
or *
with indentation sensitive items like -
or +
.
I don't know if it happens in other editors. It seems to be caused by the external scanner when it calls lexer->get_column()
, but the segfault does not occur during this call, it occurs outside of the scanner.
Could somebody try this on Helix on his machine to know if this is reproducible or not? And also test with another editor, like Neovim or Emacs.
In Emacs, markdown mode can open code block in another buffer with correct mode (like rust
, c
, typst-ts
). The opened buffer for editing other tree sitter mode works correctly (syntax highlighting, indentation, etc.) However, when opening typst-ts-mode
, it produces an error on startup:
Debugger entered--Lisp error: (wrong-type-argument stringp nil)
typst-ts-mode()
It is probably not caused by the Emacs lisp code, but the parser. As a result, all the features (like syntax highlighting) gone.
The tested markdown content is:
```typst-ts
```
I tested this with Neovim (v0.10.0-dev-530 g8376e8700) in macOS, which I use regularly, and it reproduced the problem. The editor itself was killed when I tried to paste the target code into a Typst or Markdown file (reproduced in both of them).
The bug seems to be broader than just related to indentation in embedded Typst code. I do get regularly a tree sitter failure stopping syntax highlighting for arbitrary simple files. I will try to pin down the problem but I need a way to reproduce it systematically.
Another track I am looking at is lexer simplification. I believe the actual lexer is too complex and doesn't have to be. The problem is that it sequentially try to match available token instead of looking at the available next character and pin down the corresponding token. I will attempt a rewrite with this approach.
Also, I want to migrate all tokenization to the external lexer. This would remove the need for get_column()
calls (which I suspect are the cause of the bug). Which is impossible do to the current approach.
It looks like with the new version of Tree Sitter, the bug has been fixed. Could someone try to reproduce it with a recent version of Tree Sitter? On my side, I could not.