tree-sitter-sql
tree-sitter-sql copied to clipboard
Generated too large parser.c ~ 83M
With all PRs merged, tree-sitter will generate a very large parser.c file (about 83M). And it takes about 1 min to load the parser by tree-sitter python binding.
Unfortunately I don't have an idea at the moment on how to reduce its size 😭
@m-novikov Have you ever encountered a similar problem, or have any suggestions for optimization?
I think we can generate the parser.c
when building this project
Yeah, there doesn't seem to be redundant code here, nor particularly complex rules, but still a very large parser is generated.
With compared to the tree-sitter-python
parser, the size of the grammar.js
file 2x larger (1833 lines vs 967 lines).
But the generated parser.c
is almost 64x larger (83M vs 1.3M) than the one in tree-sitter-python
.
I don't have an idea on this yet 😭
Yeah, there doesn't seem to be redundant code here, nor particularly complex rules, but still a very large parser is generated.
With compared to the
tree-sitter-python
parser, the size of thegrammar.js
file 2x larger (1833 lines vs 967 lines).But the generated
parser.c
is almost 64x larger (83M vs 1.3M) than the one intree-sitter-python
.I don't have an idea on this yet 😭
Have you try the release mode
? the follow library produced by cargo build --release
seems not so large.
There is related issue in https://github.com/tree-sitter/tree-sitter/issues/1041 While removing parser from the repo is a good stop-gap solution, we would need to investigate the cause, generation times also getting progressively longer.
There is related issue in tree-sitter/tree-sitter#1041 While removing parser from the repo is a good stop-gap solution, we would need to investigate the cause, generation times also getting progressively longer.
@m-novikov Yes, I've seen this similar issue too. But there is not much useful information here.
By remove the generated code from the repo, we just ignore this issue.
On the other hand, when a user using tree-sitter-sql, for example, using it in Python code via the tree-sitter
python package. The user use it by the following steps:
- Build language (for tree-sitter-sql) by the
tree-sitter
package, it will look for theparser.c
andscanner.cc
files in thesrc
dir and compiles them; - Create a new
tree-sitter.Parser
instance and set language to the language built in step 1; - The
src/node-types.json
andsrc/grammar.json
can also be loaded and used by users;
If we remove the src
dir from the repo, the users still need to generate it by themselves. It will take 40s+
to generate this files (for PR#58). Also the size of parser.c
is large (83M+), this leads to a long time (about 1min) to build language (the step 1).
So, to solve the problem, we need to do something to reduce the size of generated code. By now, I did some investigations on this.
- Collect generated files size for each commit to see which commit increased the size too much. The following are the top commits:
hexsha | commit_message | grammar.js | grammar.json | node-types.json | parser.c |
---|---|---|---|---|---|
8d09bf6af0ba81c799e05d839263ec3b3c49d1e2 | Support CTE, CASE, etc. New supportings: - CTEs syntax; - DELETE statement; - ALTER TABLE RENAME COLUMN; - Conditional expressions; - LIMIT clause; - AT TIME ZONE expression; Improvements: - The ORDER BY clause; - The JOIN clause; - The select_subexpression; - The UPDATE statement; - Array type; - $.type: multi-words types (VARYING, PRECISION, WITH/WITHOUT TIME ZONE); - The $.field_access expression to $.json_access expression; | 0.02M, +0.0M, +2.46% | 0.14M, +0.0M, +2.21% | 0.11M, +0.01M, +10.77% | 3.95M, +1.59M, +67.12% |
f1a4b8c896cfd4041ac2283c9f4b6d814a7c0c5d | Improve create function statement & support create trigger statement | 0.03M, +0.0M, +19.18% | 0.18M, +0.04M, +30.48% | 0.12M, +0.01M, +13.91% | 11.89M, +7.94M, +200.86% |
bf3ea7293f436b1d226e6d5179a064f4344aa65f | Improve function call to support aggregate expressions | 0.03M, -0.0M, -0.48% | 0.28M, +0.0M, +1.69% | 0.15M, +0.0M, +0.82% | 21.78M, +2.82M, +14.88% |
1cdb6abb48a4013dd7256b064e6e83574e3d0cee | Support window definitions & window function calls | 0.04M, +0.0M, +4.05% | 0.29M, +0.01M, +4.06% | 0.16M, +0.01M, +4.31% | 25.25M, +3.47M, +15.91% |
d68da8ace84b5c9b66dcd9243afbb5dade795909 | Improved the FROM clause to make it more general | 0.04M, +0.0M, +3.28% | 0.3M, +0.01M, +3.5% | 0.16M, +0.01M, +5.24% | 34.1M, +8.85M, +35.07% |
84cedb178c6a94952f1feb3083bd4e7ef54c69d8 | Improve the ALTER TABLE statement | 0.05M, +0.01M, +21.2% | 0.38M, +0.06M, +19.64% | 0.22M, +0.04M, +19.67% | 38.05M, +2.45M, +6.88% |
33a6c85ad3a26ac3d6ab0a7c54a5b2fb2e9e495c | Improve the INSERT statement | 0.05M, +0.0M, +1.97% | 0.38M, +0.01M, +1.84% | 0.22M, +0.01M, +2.39% | 54.43M, +16.26M, +42.61% |
42f750d88408fee9b990c917b907a88e96ac0111 | Support more comparison predicates 1. BETWEEN...AND expression; 2. ISNULL and NOTNULL; 3. IS UNKNOWN and IS NOT UNKNOWN; | 0.05M, +0.0M, +1.14% | 0.39M, +0.0M, +0.78% | 0.24M, +0.02M, +10.1% | 61.25M, +6.33M, +11.53% |
2660004c3495889ed947e1fb7979c17522f7a23f | Support composite values & resolve syntax conflicts | 0.05M, +0.0M, +1.15% | 0.39M, +0.0M, +1.03% | 0.25M, +0.01M, +3.89% | 62.51M, +1.26M, +2.05% |
87107e91e5b09b8d235f61e5135016d08fccd129 | Support combining queries, JSON comparison operators and type cast as column default | 0.05M, +0.0M, +2.02% | 0.4M, +0.01M, +1.29% | 0.25M, +0.0M, +0.47% | 79.69M, +17.18M, +27.48% |
5574bf3ca47507e87e3524f38783a21f0d33841d | Support VACUUM statement & improve alias | 0.05M, +0.0M, +2.28% | 0.41M, +0.01M, +3.04% | 0.25M, +0.0M, +0.69% | 82.5M, +2.81M, +3.53% |
- Then I checked changes of
grammar.js
for these top commits, and tried simply comment out some rules or clauses to see the effect. Following are the modifications and result sizes I've tried:
>>>>>>>>>> Sizes in the latest version
-rw-r--r-- 1 tim staff 421K Jul 8 21:49 grammar.json
-rw-r--r-- 1 tim staff 13K Jun 13 10:36 grammar.json.rej
-rw-r--r-- 1 tim staff 264K Jul 8 21:50 node-types.json
-rw-r--r-- 1 tim staff 17K Jun 13 10:36 node-types.json.rej
-rw-r--r-- 1 tim staff 83M Jul 8 21:50 parser.c
-rw-r--r-- 1 tim staff 5.5M Jun 13 10:36 parser.c.rej
-rw-r--r-- 1 tim staff 4.2K Jul 6 01:59 scanner.cc
drwxr-xr-x 3 tim staff 96B May 31 18:20 tree_sitter
>>>>>>>>>> With the select statement modified
1. Removed select_statement
-rw-r--r-- 1 tim staff 413K Jul 8 21:39 grammar.json
-rw-r--r-- 1 tim staff 13K Jun 13 10:36 grammar.json.rej
-rw-r--r-- 1 tim staff 257K Jul 8 21:39 node-types.json
-rw-r--r-- 1 tim staff 17K Jun 13 10:36 node-types.json.rej
-rw-r--r-- 1 tim staff 29M Jul 8 21:39 parser.c
-rw-r--r-- 1 tim staff 5.5M Jun 13 10:36 parser.c.rej
-rw-r--r-- 1 tim staff 4.2K Jul 6 01:59 scanner.cc
drwxr-xr-x 3 tim staff 96B May 31 18:20 tree_sitter
2. Removed window_clause in select_statement
-rw-r--r-- 1 tim staff 420K Jul 8 21:42 grammar.json
-rw-r--r-- 1 tim staff 13K Jun 13 10:36 grammar.json.rej
-rw-r--r-- 1 tim staff 263K Jul 8 21:42 node-types.json
-rw-r--r-- 1 tim staff 17K Jun 13 10:36 node-types.json.rej
-rw-r--r-- 1 tim staff 82M Jul 8 21:42 parser.c
-rw-r--r-- 1 tim staff 5.5M Jun 13 10:36 parser.c.rej
-rw-r--r-- 1 tim staff 4.2K Jul 6 01:59 scanner.cc
drwxr-xr-x 3 tim staff 96B May 31 18:20 tree_sitter
3. Removed from_clause in select_statement
-rw-r--r-- 1 tim staff 421K Jul 8 21:44 grammar.json
-rw-r--r-- 1 tim staff 13K Jun 13 10:36 grammar.json.rej
-rw-r--r-- 1 tim staff 264K Jul 8 21:45 node-types.json
-rw-r--r-- 1 tim staff 17K Jun 13 10:36 node-types.json.rej
-rw-r--r-- 1 tim staff 62M Jul 8 21:45 parser.c
-rw-r--r-- 1 tim staff 5.5M Jun 13 10:36 parser.c.rej
-rw-r--r-- 1 tim staff 4.2K Jul 6 01:59 scanner.cc
drwxr-xr-x 3 tim staff 96B May 31 18:20 tree_sitter
4. Removed group_by_clause in select_statement
-rw-r--r-- 1 tim staff 421K Jul 8 21:51 grammar.json
-rw-r--r-- 1 tim staff 13K Jun 13 10:36 grammar.json.rej
-rw-r--r-- 1 tim staff 263K Jul 8 21:51 node-types.json
-rw-r--r-- 1 tim staff 17K Jun 13 10:36 node-types.json.rej
-rw-r--r-- 1 tim staff 75M Jul 8 21:51 parser.c
-rw-r--r-- 1 tim staff 5.5M Jun 13 10:36 parser.c.rej
-rw-r--r-- 1 tim staff 4.2K Jul 6 01:59 scanner.cc
drwxr-xr-x 3 tim staff 96B May 31 18:20 tree_sitter
5. Removed window_clause, from_clause, group_by_clause from select_statement
-rw-r--r-- 1 tim staff 420K Jul 8 21:53 grammar.json
-rw-r--r-- 1 tim staff 13K Jun 13 10:36 grammar.json.rej
-rw-r--r-- 1 tim staff 263K Jul 8 21:53 node-types.json
-rw-r--r-- 1 tim staff 17K Jun 13 10:36 node-types.json.rej
-rw-r--r-- 1 tim staff 53M Jul 8 21:53 parser.c
-rw-r--r-- 1 tim staff 5.5M Jun 13 10:36 parser.c.rej
-rw-r--r-- 1 tim staff 4.2K Jul 6 01:59 scanner.cc
drwxr-xr-x 3 tim staff 96B May 31 18:20 tree_sitter
>>>>>>>> With expressions modified
1. Removed select_subexpression from _simple_expression
-rw-r--r-- 1 tim staff 421K Jul 8 22:00 grammar.json
-rw-r--r-- 1 tim staff 13K Jun 13 10:36 grammar.json.rej
-rw-r--r-- 1 tim staff 259K Jul 8 22:00 node-types.json
-rw-r--r-- 1 tim staff 17K Jun 13 10:36 node-types.json.rej
-rw-r--r-- 1 tim staff 83M Jul 8 22:00 parser.c
-rw-r--r-- 1 tim staff 5.5M Jun 13 10:36 parser.c.rej
-rw-r--r-- 1 tim staff 4.2K Jul 6 01:59 scanner.cc
drwxr-xr-x 3 tim staff 96B May 31 18:20 tree_sitter
2. Removed select_subexpression from all rules
-rw-r--r-- 1 tim staff 420K Jul 8 22:07 grammar.json
-rw-r--r-- 1 tim staff 13K Jun 13 10:36 grammar.json.rej
-rw-r--r-- 1 tim staff 259K Jul 8 22:07 node-types.json
-rw-r--r-- 1 tim staff 17K Jun 13 10:36 node-types.json.rej
-rw-r--r-- 1 tim staff 78M Jul 8 22:07 parser.c
-rw-r--r-- 1 tim staff 5.5M Jun 13 10:36 parser.c.rej
-rw-r--r-- 1 tim staff 4.2K Jul 6 01:59 scanner.cc
drwxr-xr-x 3 tim staff 96B May 31 18:20 tree_sitter
From the above data, we can see that the SELECT
statement introduced a large amount of size.
Unfortunately , on how to optimize it, I still have no good idea. Could you guys give me some suggestion?
Yeah, there doesn't seem to be redundant code here, nor particularly complex rules, but still a very large parser is generated. With compared to the
tree-sitter-python
parser, the size of thegrammar.js
file 2x larger (1833 lines vs 967 lines). But the generatedparser.c
is almost 64x larger (83M vs 1.3M) than the one intree-sitter-python
. I don't have an idea on this yet 😭Have you try the
release mode
? the follow library produced bycargo build --release
seems not so large.
It seems that the parser.c
, scanner.cc
etc. files are not generated by this manner, but those file are required when using the tree-sitter-sql
, at least in my case (use it in Python). Please take a look at my previous comment for more details.
parser.c
is generated file, but scanner.cc
is not. Use in python implies that user run tree-sitter generate
before. You can see it in #63 binding.gyp
in this PR has a step to generate files required.