tree-sitter-julia Parser size testing (arrows)

I was doing some testing to bring down the parser size. Tried deleting some of the operators and noticed there is a significant difference when changing the arrow operators.

Case 1 (master, f1baa5f8e271)

  arrow: `
    <-- --> <-->
    ← → ↔ ↚ ↛ ↞ ↠ ↢ ↣ ↦ ↤ ↮ ⇎ ⇍ ⇏ ⇐ ⇒ ⇔ ⇴ ⇶ ⇷ ⇸ ⇹ ⇺ ⇻ ⇼ ⇽ ⇾ ⇿ ⟵ ⟶ ⟷ ⟹ ⟺ ⟻ ⟼ ⟽ ⟾ ⟿
    ⤀ ⤁ ⤂ ⤃ ⤄ ⤅ ⤆ ⤇ ⤌ ⤍ ⤎ ⤏ ⤐ ⤑ ⤔ ⤕ ⤖ ⤗ ⤘ ⤝ ⤞ ⤟ ⤠ ⥄ ⥅ ⥆ ⥇ ⥈ ⥊ ⥋ ⥎ ⥐ ⥒ ⥓ ⥖ ⥗ ⥚ ⥛ ⥞
    ⥟ ⥢ ⥤ ⥦ ⥧ ⥨ ⥩ ⥪ ⥫ ⥬ ⥭ ⥰ ⧴ ⬱ ⬰ ⬲ ⬳ ⬴ ⬵ ⬶ ⬷ ⬸ ⬹ ⬺ ⬻ ⬼ ⬽ ⬾ ⬿ ⭀ ⭁ ⭂ ⭃ ⥷ ⭄ ⥺ ⭇ ⭈ ⭉
    ⭊ ⭋ ⭌ ￩ ￫ ⇜ ⇝ ↜ ↝ ↩ ↪ ↫ ↬ ↼ ↽ ⇀ ⇁ ⇄ ⇆ ⇇ ⇉ ⇋ ⇌ ⇚ ⇛ ⇠ ⇢ ↷ ↶ ↺ ↻
  `,

❯ du -sh src/parser.c
49M     src/parser.c

❯ cat src/parser.c | rg "#define.*STATE"
#define STATE_COUNT 19881
#define LARGE_STATE_COUNT 9618

Case 2 (https://github.com/ChrHorn/tree-sitter-julia/commit/e66d1bf1a73e4e42e86a70830e0d02c2016cc92d)

Deleted most of the arrow operators.

  arrow: `
   <-- --> <-->
   ← → ↔
  `,

No visible change in states and parser size.

❯ du -sh src/parser.c
49M     src/parser.c

❯ cat src/parser.c | rg "#define.*STATE"
#define STATE_COUNT 19881
#define LARGE_STATE_COUNT 9618

Case 3 (https://github.com/ChrHorn/tree-sitter-julia/commit/e64b8fcfd7fcc78fdfeacd54b145ab265367799f)

Notice the only difference to Case 2 is the one deleted ↔ arrow operator.

  arrow: `
    <-- --> <-->
    ← →
  `,

Leads to a pretty significant reduction in states and parser size.

❯ du -sh src/parser.c
32M     src/parser.c

❯ cat src/parser.c | rg "#define.*STATE"
#define STATE_COUNT 12760
#define LARGE_STATE_COUNT 5869

Not really sure what's going on. I don' think it's Unicode, for example

  arrow: `
    ⥟ ⥢ ⥤ ⥦ ⥧ ⥨ ⥩ ⥪ ⥫ ⥬ ⥭ ⥰ ⧴ ⬱ ⬰ ⬲ ⬳ ⬴ ⬵ ⬶ ⬷ ⬸ ⬹ ⬺ ⬻ ⬼ ⬽ ⬾ ⬿ ⭀ ⭁ ⭂ ⭃ ⥷ ⭄ ⥺ ⭇ ⭈ ⭉
  `,

also results in a smaller parser. I also only noticed this behavior when changing the arrow operators. The change is always binary (either smaller, or current larger parser size), nothing in between.

@savq are you able to reproduce this on your end, any idea?

Jul 30 '24 08:07 ChrHorn

Yep. I was able to reproduce it.

The unicode operators affect the parser size in weird ways. That's part of the reason the ASCII operators are listed first (to comment out the unicode operators easily), but to be honest I never looked to much into it.

After some of the discussions regarding binary size, I've been thinking that maybe we could generate two parsers (like typescript/tsx in tree-sitter-typescript. A small one for editors, and a bigger one for other tools. The small one could omit most unicode operators, implicit multiplications, etc.

Aug 01 '24 23:08 savq

Did some further testing.

https://github.com/tree-sitter/tree-sitter-julia/blob/f1baa5f8e271109d01cc8ff7473c11df2d8a9aee/grammar.js#L970 Removing the \\p{Emoji} here also fixes the issue and results in a smaller parser. This would probably be a less intrusive change instead of deleting a whole bunch of operators.

Could also think about using a custom scanner for the identifiers. The Base Julia definition is already in C, so it could be a relatively simple copy-paste.

https://github.com/JuliaLang/julia/blob/996351f5f6651d1508aef3c35c7d37eb22a0fb1e/src/flisp/julia_extensions.c#L127

Aug 05 '24 08:08 ChrHorn

Could also think about using a custom scanner for the identifiers. The Base Julia definition is already in C, so it could be a relatively simple copy-paste.

The external scanner is an option. We do exactly that in JuliaPluto/lezer-julia.

To use Julia's C code directly we'd have to link JuliaStrings/utf8proc, or find another way to check unicode categories. tree-sitter used to have utf8proc as a dependency, but I've got no idea why they dropped it tho.

Aug 05 '24 21:08 savq

tree-sitter used to have utf8proc as a dependency, but I've got no idea why they dropped it tho.

Wasm parsers, which is also why pulling that in for an external scanner is a no-go. (It's still a dependency for the lib.)

Tree-sitter (and its base tooling) seems to have difficulties with certain unicode character classes, especially if you pick-and-choose from them rather than taking them as-is (Perl had a related issue). This is really something that should be solved upstream, so please open an issue about it there with the "binary change" here -- which might be helpful in tracking down a bug or possible optimization. (There have already been significant improvements in this area in recent version but obviously there's still much room for improvement.)

My recommendation here is to bisect the glyphs to identify low-hanging fruits (individually or in groups) and comment them out until the upstream issue is fixed. Parser size is also directly linked to load time, and you won't see many people use this parser if opening a Julia file freezes the editor for a second or half...

(Personally, since actually using Julia I have transitioned from "Unicode is cute!" to "Unicode is evil!" since it makes searching and replacing in editors so much harder, and over-use quickly makes code unreadable.)

Aug 09 '24 07:08 clason

tree-sitter-julia tree-sitter-julia copied to clipboard

Parser size testing (arrows)

Case 1 (master, f1baa5f8e271)

Case 2 (https://github.com/ChrHorn/tree-sitter-julia/commit/e66d1bf1a73e4e42e86a70830e0d02c2016cc92d)

Case 3 (https://github.com/ChrHorn/tree-sitter-julia/commit/e64b8fcfd7fcc78fdfeacd54b145ab265367799f)

tree-sitter-julia
tree-sitter-julia copied to clipboard