tree-sitter icon indicating copy to clipboard operation
tree-sitter copied to clipboard

Support custom text decoding function

Open lukedan opened this issue 4 years ago • 41 comments

I've been working on a text editor that handles raw binary data of arbitrary character encoding, and I'm trying to integrate tree-sitter into it for syntax highlighting. However, it seems that tree-sitter only supports UTF-8 and UTF-16. I did a bit of digging and discovered that, in ts_lexer__get_lookahead in lexer.c, a UnicodeDecoderFunction is used to decode text which is actually general enough to be extended to a lot of other encodings. It would be great (and probably fairly simple) to expose this as part of the interface instead, so that users can use whichever encoding they want by providing custom decode functions.

lukedan avatar Aug 29 '20 05:08 lukedan

Issue-Label Bot is automatically applying the label feature_request to this issue, with a confidence of 0.94. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] avatar Aug 29 '20 05:08 issue-label-bot[bot]

Yes, I think you’re right that this would make sense.

maxbrunsfeld avatar Aug 29 '20 16:08 maxbrunsfeld

While trying to run the tests (I haven't modified the code yet), it always seems to get stuck at

test language: "external_and_internal_anonymous_tokens"
  example: "corpus - single-line statements - internal tokens"
  example: "corpus - multi-line statements - internal tokens"
  example: "corpus - single-line statements - external tokens"

I've tried the master branch as well as the 0.18.0 branch. I checked the test grammers and it seems fairly simple. Is this expected?

lukedan avatar Jan 30 '21 21:01 lukedan

Also, I've had a question unrelated to the issue back when I started using the library: tree-sitter uses both byte positions and line/column positions, which seems a bit redundant. I understand that some languages may be sensitive to line breaks, but it does not seem necessary when making edits to the tree, especially since users can have different definitions for line breaks and it can be expensive to maintain this information specifically for tree-sitter.

lukedan avatar Jan 30 '21 21:01 lukedan

it always seems to get stuck at

Can you run the tests under a debugger (gdb or lldb) to determine where the main thread is blocked? For convenience, you can do script/test -g to run with a debugger, assuming you have one installed.

maxbrunsfeld avatar Jan 30 '21 21:01 maxbrunsfeld

tree-sitter uses both byte positions and line/column positions, which seems a bit redundant.

Most applications that use Tree-sitter need to be able to query for nodes' positions (in terms of row and column), not just their byte range.

maxbrunsfeld avatar Jan 30 '21 21:01 maxbrunsfeld

Most applications that use Tree-sitter need to be able to query for nodes' positions (in terms of row and column), not just their byte range.

I understand that this is really helpful. What I was trying to say is that tree-sitter can obtain all the information it needs to generate row/column information, simply by parsing the document. The user has already supplied byte positions to the library while making an edit, they should not need to supply line/column information as well; tree-sitter can simply deduce that from all the information it already has.

lukedan avatar Jan 30 '21 22:01 lukedan

Unfortunately, it can't. To do that, Tree-sitter would have to keep a copy of the source code, which in many apps, is undesirable from a memory and performance perspective.

maxbrunsfeld avatar Jan 30 '21 22:01 maxbrunsfeld

I see. It would still be helpful, though, to properly define what is a line break: currently I can't find any documentation on this. From the source code it seems that tree-sitter only recognizes \n, but some applications also use \r and \r\n.

lukedan avatar Jan 30 '21 22:01 lukedan

Running script/test -g gives this error:

error: Found argument '-g' which wasn't expected, or isn't valid in this context

USAGE:
    cargo.exe test --package <SPEC>...

For more information try --help

lukedan avatar Jan 30 '21 22:01 lukedan

Yeah, it's just \n. This also works fine for the windows line ending \r\n. Single carriage returns (\r) and other unicode newline characters do not count as line endings. This generally matches the behavior of other standard programming tools like grep, vim, git, wc, etc.

You're right though, this would make sense to explain in the documentation. It also could be made configurable (as a stateful setting on the TSParser) pretty easily.

maxbrunsfeld avatar Jan 30 '21 22:01 maxbrunsfeld

Running script/test -g gives this error

Are you on windows?

maxbrunsfeld avatar Jan 30 '21 22:01 maxbrunsfeld

Yes. I was surprised that these bash scripts actually run at all.

lukedan avatar Jan 30 '21 22:01 lukedan

Well... maybe I wasn't running the bash scripts. Running test.cmd -g gives the same error.

lukedan avatar Jan 30 '21 22:01 lukedan

Yeah, on windows I don't have any built-in helpers for running a debugger.

maxbrunsfeld avatar Jan 30 '21 22:01 maxbrunsfeld

Running the executable directly didn't lock up, but 73 tests failed. It produced a LOT of output - I can post them here if you're interested.

lukedan avatar Jan 30 '21 23:01 lukedan

Maybe look at the appveyor config to see a concrete set of instructions that reliably pass on windows?

I don’t know if I am able to debug the specific failures that you’re seeing.

maxbrunsfeld avatar Jan 30 '21 23:01 maxbrunsfeld

Max Brunsfeld [email protected] writes:

Unfortunately, it can't. To do that, Tree-sitter would have to keep a copy of the source code, which in many apps, is undesirable from a memory and performance perspective.

Hmm. If the lexer reports newlines in added text, and you count deleted newlines in deleted tree nodes, that would be sufficient. That's what I'm doing in WisiToken.

-- -- Stephe

stephe-ada-guru avatar Jan 31 '21 01:01 stephe-ada-guru

Maybe look at the appveyor config to see a concrete set of instructions that reliably pass on windows?

I'm not so familiar with appveyor - can you elaborate on that? From what I can see in .appveyor.yml it just runs test.cmd, and I can't see any kind of list in there.

Regarding the failed tests, most of them are panics about "Leaked allocation indices" and "Once instance has previously been poisoned". Some are about "Failed to load symbol tree_sitter_html". The last one seems to be related to my setup.

lukedan avatar Jan 31 '21 02:01 lukedan

The freeze seems to only happen with the environment variable RUST_TEST_THREADS set to 1. It's stuck in parser.c:394 in ts_parser__lex(), where the scan() function repeatedly calls ts_lexer__advance(). Since self->chunk is empty it always returns immediately.

lukedan avatar Jan 31 '21 03:01 lukedan

The scan() function of self->language->external_scanner comes from external_and_internal_anonymous_tokens.dll which I can't get VSCode to load symbols for.

lukedan avatar Jan 31 '21 03:01 lukedan

The scanner has a few pieces of code that looks like while (lexer->lookahead != 'SOMETHING') lexer->advance(lexer, true/false); which could be where it's freezing. The compilation process does not seem to produce symbol files for it which makes it harder to debug.

lukedan avatar Jan 31 '21 03:01 lukedan

@maxbrunsfeld Any tips on how I can debug this problem? (e.g., how can I enable debug output?) Also, can you reproduce this on your side? I'm not sure if this is a problem with the code or if it's a problem with my setup, especially since having the tests run single-threaded should remove problems, not create new ones.

lukedan avatar Feb 01 '21 18:02 lukedan

I don't think it's a problem with the code because we have a CI system set up that runs the tests on windows on every commit. That's what I mean by consulting the Appveyor config file. It is a concrete set of steps that is known to work on some standard windows VMs.

maxbrunsfeld avatar Feb 01 '21 18:02 maxbrunsfeld

I cloned a fresh repo and followed the commands in .appveyor.yml but it still froze. Honestly it's quite a simple process and I doubt there's much that can go wrong.

I noticed that there's a cache entry in the file - could caching lead to stale tests being run?

lukedan avatar Feb 01 '21 18:02 lukedan

could caching lead to stale tests being run?

I don't think so.

It's been a while since I did native debugging on Windows. I've used WinDBG in the past. Is that what you've already been using to investigate?

maxbrunsfeld avatar Feb 01 '21 18:02 maxbrunsfeld

I've been using VSCode which is working relatively well. One issue is that there are no symbols for the scanner for that test, but the scanner is relatively simple and it's not a big issue. I'm also not quite sure where to start - where are the tests defined? How can I get it to run individual tests?

lukedan avatar Feb 01 '21 19:02 lukedan

Ok - this freezes even when running only that example of that language of that test. More specifically, corpus - single-line statements - external tokens of language external_and_internal_anonymous_tokens. Here's the log:

new_parse
process version:0, version_count:1, state:1, row:1, col:0
lex_external state:1, row:1, column:0
lex_internal state:0, row:1, column:0
  skip character:13
  skip character:10
lex_external state:1, row:1, column:0
lex_internal state:0, row:1, column:0
  skip character:13
  skip character:10
skip_unrecognized_character
  consume character:'''
lex_external state:1, row:2, column:1
lex_internal state:0, row:2, column:1
  consume character:'h'
  consume character:'e'
  consume character:'l'
  consume character:'o'
lexed_lookahead sym:ERROR, size:3, character:'''
detect_error
resume version:0
process version:0, version_count:1, state:0, row:1, col:0
lex_external state:1, row:1, column:0
lex_internal state:0, row:1, column:0
  skip character:13
  skip character:10
skip_unrecognized_character
  consume character:'''
lex_external state:1, row:2, column:1
lex_internal state:0, row:2, column:1
  consume character:'h'
  consume character:'e'
  consume character:'l'
  consume character:'o'
lexed_lookahead sym:ERROR, size:3, character:'''
skip_token symbol:ERROR
process version:0, version_count:1, state:0, row:2, col:1
lex_external state:1, row:2, column:1
lex_internal state:0, row:2, column:1
  consume character:'h'
  consume character:'e'
  consume character:'l'
  consume character:'o'
lexed_lookahead sym:variable, size:5
recover_to_previous state:1, depth:2
skip_token symbol:variable
process version:1, version_count:2, state:1, row:2, col:1
shift state:2
condense
process version:0, version_count:1, state:2, row:2, col:6
lex_external state:1, row:2, column:6
  consume character:'''
  consume character:' '
  consume character:'''
lexed_lookahead sym:string, size:3
shift state:4
process version:0, version_count:1, state:4, row:2, col:9
lex_external state:2, row:2, column:9
lex_internal state:1, row:2, column:9
lex_external state:1, row:2, column:9
lex_internal state:0, row:2, column:9
  consume character:'w'
  consume character:'o'
  consume character:'r'
  consume character:'l'
  consume character:'d'
lexed_lookahead sym:variable, size:5
detect_error
resume version:0
process version:0, version_count:1, state:0, row:2, col:9
lex_external state:1, row:2, column:9
lex_internal state:0, row:2, column:9
  consume character:'w'
  consume character:'o'
  consume character:'r'
  consume character:'l'
  consume character:'d'
lexed_lookahead sym:variable, size:5
recover_to_previous state:2, depth:2
skip_token symbol:variable
process version:1, version_count:2, state:2, row:2, col:9
shift state:4
condense
process version:0, version_count:1, state:4, row:2, col:14
lex_external state:2, row:2, column:14
lex_internal state:1, row:2, column:14
lex_external state:1, row:2, column:14
  consume character:'''
  consume character:13
  consume character:10

Apparently it's failing at the end of the first word. It may seem that it's missing a 'l' but that's just VSCode being stupid and folding duplicate output lines. Also now that I'm only running this test it also hangs without RUST_TEST_THREADS=1.

lukedan avatar Feb 01 '21 19:02 lukedan

Any idea why this might be happening? From what I can see the word 'hello' should be processed by the external lexer, not the internal one.

lukedan avatar Feb 01 '21 20:02 lukedan

Yeah, that narrows it down a bit. I am guessing it is because the LF (\n) line ending characters in the repository have all been converted to CRLF (\r\n) because of some configuration setting in your installation of Git for windows.

It looks like one of the "dummy" external scanners in the text/fixtures/test_grammars directory does not handle CRLF line endings.

maxbrunsfeld avatar Feb 01 '21 21:02 maxbrunsfeld

I hacked the scanner to treat \r as a whitespace and now it's passing. Unfortunately there are still a lot of fails like The specified procedure could not be found. (os error 127), Failed to load symbol tree_sitter_html. What could be the reason for this?

lukedan avatar Feb 01 '21 22:02 lukedan