tree-sitter
tree-sitter copied to clipboard
Support custom text decoding function
I've been working on a text editor that handles raw binary data of arbitrary character encoding, and I'm trying to integrate tree-sitter into it for syntax highlighting. However, it seems that tree-sitter only supports UTF-8 and UTF-16. I did a bit of digging and discovered that, in ts_lexer__get_lookahead
in lexer.c, a UnicodeDecoderFunction
is used to decode text which is actually general enough to be extended to a lot of other encodings. It would be great (and probably fairly simple) to expose this as part of the interface instead, so that users can use whichever encoding they want by providing custom decode functions.
Issue-Label Bot is automatically applying the label feature_request
to this issue, with a confidence of 0.94. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!
Links: app homepage, dashboard and code for this bot.
Yes, I think you’re right that this would make sense.
While trying to run the tests (I haven't modified the code yet), it always seems to get stuck at
test language: "external_and_internal_anonymous_tokens"
example: "corpus - single-line statements - internal tokens"
example: "corpus - multi-line statements - internal tokens"
example: "corpus - single-line statements - external tokens"
I've tried the master branch as well as the 0.18.0 branch. I checked the test grammers and it seems fairly simple. Is this expected?
Also, I've had a question unrelated to the issue back when I started using the library: tree-sitter uses both byte positions and line/column positions, which seems a bit redundant. I understand that some languages may be sensitive to line breaks, but it does not seem necessary when making edits to the tree, especially since users can have different definitions for line breaks and it can be expensive to maintain this information specifically for tree-sitter.
it always seems to get stuck at
Can you run the tests under a debugger (gdb
or lldb
) to determine where the main thread is blocked? For convenience, you can do script/test -g
to run with a debugger, assuming you have one installed.
tree-sitter uses both byte positions and line/column positions, which seems a bit redundant.
Most applications that use Tree-sitter need to be able to query for nodes' positions (in terms of row and column), not just their byte range.
Most applications that use Tree-sitter need to be able to query for nodes' positions (in terms of row and column), not just their byte range.
I understand that this is really helpful. What I was trying to say is that tree-sitter can obtain all the information it needs to generate row/column information, simply by parsing the document. The user has already supplied byte positions to the library while making an edit, they should not need to supply line/column information as well; tree-sitter can simply deduce that from all the information it already has.
Unfortunately, it can't. To do that, Tree-sitter would have to keep a copy of the source code, which in many apps, is undesirable from a memory and performance perspective.
I see. It would still be helpful, though, to properly define what is a line break: currently I can't find any documentation on this. From the source code it seems that tree-sitter only recognizes \n
, but some applications also use \r
and \r\n
.
Running script/test -g
gives this error:
error: Found argument '-g' which wasn't expected, or isn't valid in this context
USAGE:
cargo.exe test --package <SPEC>...
For more information try --help
Yeah, it's just \n
. This also works fine for the windows line ending \r\n
. Single carriage returns (\r
) and other unicode newline characters do not count as line endings. This generally matches the behavior of other standard programming tools like grep
, vim
, git
, wc
, etc.
You're right though, this would make sense to explain in the documentation. It also could be made configurable (as a stateful setting on the TSParser
) pretty easily.
Running
script/test -g
gives this error
Are you on windows?
Yes. I was surprised that these bash scripts actually run at all.
Well... maybe I wasn't running the bash scripts. Running test.cmd -g
gives the same error.
Yeah, on windows I don't have any built-in helpers for running a debugger.
Running the executable directly didn't lock up, but 73 tests failed. It produced a LOT of output - I can post them here if you're interested.
Maybe look at the appveyor config to see a concrete set of instructions that reliably pass on windows?
I don’t know if I am able to debug the specific failures that you’re seeing.
Max Brunsfeld [email protected] writes:
Unfortunately, it can't. To do that, Tree-sitter would have to keep a copy of the source code, which in many apps, is undesirable from a memory and performance perspective.
Hmm. If the lexer reports newlines in added text, and you count deleted newlines in deleted tree nodes, that would be sufficient. That's what I'm doing in WisiToken.
-- -- Stephe
Maybe look at the appveyor config to see a concrete set of instructions that reliably pass on windows?
I'm not so familiar with appveyor - can you elaborate on that? From what I can see in .appveyor.yml
it just runs test.cmd
, and I can't see any kind of list in there.
Regarding the failed tests, most of them are panics about "Leaked allocation indices" and "Once instance has previously been poisoned". Some are about "Failed to load symbol tree_sitter_html". The last one seems to be related to my setup.
The freeze seems to only happen with the environment variable RUST_TEST_THREADS
set to 1
. It's stuck in parser.c:394
in ts_parser__lex()
, where the scan()
function repeatedly calls ts_lexer__advance()
. Since self->chunk
is empty it always returns immediately.
The scan()
function of self->language->external_scanner
comes from external_and_internal_anonymous_tokens.dll
which I can't get VSCode to load symbols for.
The scanner has a few pieces of code that looks like while (lexer->lookahead != 'SOMETHING') lexer->advance(lexer, true/false);
which could be where it's freezing. The compilation process does not seem to produce symbol files for it which makes it harder to debug.
@maxbrunsfeld Any tips on how I can debug this problem? (e.g., how can I enable debug output?) Also, can you reproduce this on your side? I'm not sure if this is a problem with the code or if it's a problem with my setup, especially since having the tests run single-threaded should remove problems, not create new ones.
I don't think it's a problem with the code because we have a CI system set up that runs the tests on windows on every commit. That's what I mean by consulting the Appveyor config file. It is a concrete set of steps that is known to work on some standard windows VMs.
I cloned a fresh repo and followed the commands in .appveyor.yml
but it still froze. Honestly it's quite a simple process and I doubt there's much that can go wrong.
I noticed that there's a cache
entry in the file - could caching lead to stale tests being run?
could caching lead to stale tests being run?
I don't think so.
It's been a while since I did native debugging on Windows. I've used WinDBG in the past. Is that what you've already been using to investigate?
I've been using VSCode which is working relatively well. One issue is that there are no symbols for the scanner for that test, but the scanner is relatively simple and it's not a big issue. I'm also not quite sure where to start - where are the tests defined? How can I get it to run individual tests?
Ok - this freezes even when running only that example of that language of that test. More specifically, corpus - single-line statements - external tokens
of language external_and_internal_anonymous_tokens
. Here's the log:
new_parse
process version:0, version_count:1, state:1, row:1, col:0
lex_external state:1, row:1, column:0
lex_internal state:0, row:1, column:0
skip character:13
skip character:10
lex_external state:1, row:1, column:0
lex_internal state:0, row:1, column:0
skip character:13
skip character:10
skip_unrecognized_character
consume character:'''
lex_external state:1, row:2, column:1
lex_internal state:0, row:2, column:1
consume character:'h'
consume character:'e'
consume character:'l'
consume character:'o'
lexed_lookahead sym:ERROR, size:3, character:'''
detect_error
resume version:0
process version:0, version_count:1, state:0, row:1, col:0
lex_external state:1, row:1, column:0
lex_internal state:0, row:1, column:0
skip character:13
skip character:10
skip_unrecognized_character
consume character:'''
lex_external state:1, row:2, column:1
lex_internal state:0, row:2, column:1
consume character:'h'
consume character:'e'
consume character:'l'
consume character:'o'
lexed_lookahead sym:ERROR, size:3, character:'''
skip_token symbol:ERROR
process version:0, version_count:1, state:0, row:2, col:1
lex_external state:1, row:2, column:1
lex_internal state:0, row:2, column:1
consume character:'h'
consume character:'e'
consume character:'l'
consume character:'o'
lexed_lookahead sym:variable, size:5
recover_to_previous state:1, depth:2
skip_token symbol:variable
process version:1, version_count:2, state:1, row:2, col:1
shift state:2
condense
process version:0, version_count:1, state:2, row:2, col:6
lex_external state:1, row:2, column:6
consume character:'''
consume character:' '
consume character:'''
lexed_lookahead sym:string, size:3
shift state:4
process version:0, version_count:1, state:4, row:2, col:9
lex_external state:2, row:2, column:9
lex_internal state:1, row:2, column:9
lex_external state:1, row:2, column:9
lex_internal state:0, row:2, column:9
consume character:'w'
consume character:'o'
consume character:'r'
consume character:'l'
consume character:'d'
lexed_lookahead sym:variable, size:5
detect_error
resume version:0
process version:0, version_count:1, state:0, row:2, col:9
lex_external state:1, row:2, column:9
lex_internal state:0, row:2, column:9
consume character:'w'
consume character:'o'
consume character:'r'
consume character:'l'
consume character:'d'
lexed_lookahead sym:variable, size:5
recover_to_previous state:2, depth:2
skip_token symbol:variable
process version:1, version_count:2, state:2, row:2, col:9
shift state:4
condense
process version:0, version_count:1, state:4, row:2, col:14
lex_external state:2, row:2, column:14
lex_internal state:1, row:2, column:14
lex_external state:1, row:2, column:14
consume character:'''
consume character:13
consume character:10
Apparently it's failing at the end of the first word. It may seem that it's missing a 'l' but that's just VSCode being stupid and folding duplicate output lines. Also now that I'm only running this test it also hangs without RUST_TEST_THREADS=1
.
Any idea why this might be happening? From what I can see the word 'hello' should be processed by the external lexer, not the internal one.
Yeah, that narrows it down a bit. I am guessing it is because the LF (\n
) line ending characters in the repository have all been converted to CRLF (\r\n
) because of some configuration setting in your installation of Git for windows.
It looks like one of the "dummy" external scanners in the text/fixtures/test_grammars
directory does not handle CRLF line endings.
I hacked the scanner to treat \r
as a whitespace and now it's passing. Unfortunately there are still a lot of fails like The specified procedure could not be found. (os error 127)
, Failed to load symbol tree_sitter_html
. What could be the reason for this?