tree-sitter-haskell
tree-sitter-haskell copied to clipboard
Native and WASM parsers behave differently
For instance, if we parse examples/postgrest/test/Main.hs
:
- using native, the top-level node is
haskell
; - using wasm, the top-level node is
ERROR
.
In this case, the problem seems to be absent module headers.
That's interesting, you can try running a -DDEBUG
build, and see what the first few tokens the scanner produces are?
Not quite sure how to run a -DDEBUG build? You should be able to run these tests yourself, though. Lemme dig up the commands.
Check out #68 and run npm run examples-wasm
. You'll need Node and Emscripten.
Parsing examples/postgrest/test/Main.hs
passes if you delete either both let
statements, or the spec $ do ...
block
I've created a list of the (smallest to largest) files that fail to parse under wasm:
{ for i in examples/**/*.hs; do out="$(2>&1 ./script/tree-sitter-parse.js $i)"; if echo "$out" | grep 'Parse error' > /dev/null; then wc -l "$i"; fi; done } | sort -n | tee list
output: here
The first file passes when you delete the ->
in the GADT, or the type signature of test
, or any two comment lines... I'll have a proper look this evening.
If you can spot anything that those files have in common, and others don't, let me know
One likely reason for the behavior difference is that tree-sitter’s get_column function currently returns a byte count, not a character count. The JavaScript bindings encode strings as UTF16, so most characters are 2 bytes instead of 1, as in UTF8.
The get_column method is still a bit experimental, and we’ve talked about changing it back to returning a character count. In the meantime, scanners should make sure to not make assumptions about the absolute values that method returns, but instead just compare the values to each other.
Not sure if that’s the problem here, but wanted to raise it, since it’s gotcha currently, and this is one of the few grammars affected.
Thanks @maxbrunsfeld, that sounds very plausible.
I wonder if I ~halve~ double the indentation on those files whether they'll break with the native parser...
edit they didn't
It's about time I get off my butt and write good tests for my codepoint-based get_column
tree-sitter fork so it can be merged upstream.
I can confirm that this is mostly fixed using the latest version, the only file that doesn't parse using the web assembly bindings is <examples/haskell-language-server/test/testdata/format/BrittanyCRLF.formatted_range.hs>.
Ok great! If anyone gets a chance, it'd be super useful to extract from that file a minimal example of how the WASM and native parsers behave differently. At this point, if they still behave differently, that seems most likely to be a bug in Tree-sitter.