tree-sitter-rust icon indicating copy to clipboard operation
tree-sitter-rust copied to clipboard

Mismatch between WASM and native builds

Open ubolonton opened this issue 5 years ago • 2 comments

There seems to be a mismatch between WASM and native builds (on macOS).

I built the CLI from latest tree-sitter's master (4c0fa29) and tried this code:

macro_rules! impl_pred {}

// TODO
i
impl_pred!(foo, bar);

This is the syntax tree reported by the native binding (tree-sitter test passes for this commit):

(source_file
 (macro_definition name:
                   (identifier))
 (line_comment)
 (macro_invocation macro:
                   (identifier)
                   (ERROR
                    (identifier))
                   (token_tree
                    (identifier)
                    (identifier))))

This is the syntax tree reported by WASM (through tree-sitter web-ui):

(source_file
 (macro_definition name:
                   (identifier))
 (line_comment)
 (identifier)
 (MISSING ";")
 (macro_invocation macro:
                   (identifier)
                   (token_tree
                    (identifier)
                    (identifier))))

ubolonton avatar Apr 11 '20 11:04 ubolonton

I think it may be because the wasm binding uses the UTF16 encoding, due to javascript’s string semantics. Do you still see a mismatch if you transcode to UTF16 in your rust test?

The reason that it matters is that certain “error costs” are calculated using nodes’ byte length. This is something I’ve been a bit unsatisfied with for a while, but I still don’t think it’s worth the memory cost to store each node’s Unicode character count.

We could make them behave more similarly by dividing the byte count by 2 when using UTF16. 😸I’d be curious if you have any suggestions.

maxbrunsfeld avatar Apr 11 '20 16:04 maxbrunsfeld

I think it may be because the wasm binding uses the UTF16 encoding, due to javascript’s string semantics. Do you still see a mismatch if you transcode to UTF16 in your rust test?

It seems so! There's no mismatch if I change run_tests to do this:

// let tree = parser.parse(&input, None).unwrap();
let utf16: Vec<u16> = str::from_utf8(&input).unwrap()
    .encode_utf16().into_iter().collect();
let tree = parser.parse_utf16(&utf16, None).unwrap();

The reason that it matters is that certain “error costs” are calculated using nodes’ byte length. This is something I’ve been a bit unsatisfied with for a while, but I still don’t think it’s worth the memory cost to store each node’s Unicode character count.

Yeah, I think it's not worth it, at least for programming languages. Non-ascii characters are rare, and would mostly be in comments/strings. I'm not sure about markup languages though. Maybe we could let grammars override the error costs in specific places?

We could make them behave more similarly by dividing the byte count by 2 when using UTF16. 😸I’d be curious if you have any suggestions.

I think making them more similar would be good, but I'm not sure about dividing by 2 when it's UTF16. 😅 In this case specifically, the syntax tree for UTF16 is more desirable:

In Atom (also UTF16 I assume): test2 atom rs

In Emacs (UTF8): test2 emacs rs

ubolonton avatar Apr 12 '20 06:04 ubolonton