binjs-ref
binjs-ref copied to clipboard
Pre-defined grammar table with file-local remap in multipart .binjs file
Background
The multipart format for .binjs
file contains [GRAMMAR]
section, that defines the table for index for each interface (or grammar name?) used in the file, like, 0 => CallExpression.
Problem
Currently there are 106 interfaces in BinSource.webidl_
file, and enumerating all of them takes ~2kB for each file.
Solution
Pre-define the table, and store it inside the engine.
Pros
- The
.binjs
file size can be reduced by up to 2kB, before compression - The time taken by parsing
[GRAMMAR]
section can be reduced
Cons
- If the number of interfaces becomes more than 127 in the future, index for some interface will take 2bytes, and
[TREE]
may take more spaces compared to the current format - If unknown grammar is used in the file, the engine cannot show the detail about the error (like, which syntax is not supported)
Optimization for solving the Cons 1
Embed an information for remapping the index, in [GRAMMAR]
section.
Example
Consider that the predefined table looks like the following (just an example, the actual table should use more efficient order):
Index | Interface |
---|---|
0 | _Null |
1 | VariableDeclaration |
2 | CallExpression |
3 | WhileStatement |
4 | SwitchCase |
... | ... |
50 | ExpressionStatement |
... | ... |
180 | IdentifierExpression |
... | ... |
200 | LiteralNumericExpression |
... | ... |
260 | Script |
... | ... |
The [TREE]
for print(10);
will look like the following byte stream:
28 # treeByteLen = 21
F5 02 # = 260 (Script)
00 # Script.OptionalAssertedVarScope = _Null
00 # Script.ListOfDirective.items = 0
02 # Script.parseListOfStatement.items = 1
64 # = 50 (ExpressionStatement)
04 # = 2 (CallExpression)
69 02 # = 180 (IdentifierExpression)
00 # IdentifierExpression.name = "print"
02 # Arguments.items = 1
91 02 # = 200 (LiteralNumericExpression)
00 00 00 00 00 00 24 40 # LiteralNumericExpression.value = 10
There, IdentifierExpression
, LiteralNumericExpression
, and Script
are taking 2 bytes, while most interfaces with 1 bytes are not used there.
If we remap the table like the following:
Index | Interface |
---|---|
0 | _Null |
1 | IdentifierExpression |
2 | CallExpression |
3 | LiteralNumericExpression |
4 | Script |
... | ... |
50 | ExpressionStatement |
... | ... |
180 | IdentifierExpression |
... | ... |
200 | LiteralNumericExpression |
... | ... |
260 | Script |
... | ... |
The [TREE]
for print(10);
will become the following byte stream:
24 # treeByteLen = 18
08 # = 4 (Script)
00 # Script.OptionalAssertedVarScope = _Null
00 # Script.ListOfDirective.items = 0
02 # Script.parseListOfStatement.items = 1
64 # = 50 (ExpressionStatement)
04 # = 2 (CallExpression)
02 # = 1 (IdentifierExpression)
00 # IdentifierExpression.name = "print"
02 # Arguments.items = 1
06 # = 3 (LiteralNumericExpression)
00 00 00 00 00 00 24 40 # LiteralNumericExpression.value = 10
The remap can be encoded like the following byte stream:
69 02 # from: 180 (IdentifierExpression)
02 # to: 1
91 02 # from: 200 (LiteralNumericExpression)
06 # to: 3
F5 02 # from: 260 (Script)
08 # to: 4
In this small example, the actual file size increases tho, in the large code, the size taken by the remap information will be negligible, compared to the [TREE]
data reduction.
Encoder can choose whether to remap or not, depending on the file content. They can remap interfaces with 2bytes index only when the interface is used many times in the file content.
Handling grammar evolution/extensions
For the base ES grammar, the grammar table is supposed to extend over time, keeping the index for pre-existing interfaces same.
For proposals, including the ECMAScript spec proposals and BinAST-specific optimization etc, there can be multiple proposal at once, so the table can temporarily branch.
This can be solved by the namespace (thanks to @RReverser), and merging table.
Namespace
Each proposal should have their own namespace (maybe URL-like thing), and namespace-local table (index => interface).
By keeping the table's pre-existing entry's index same, we don't have to add version number for the namespace.
Merging table
Supposing the pre-existing index keeps same over time, the merging can be done just by putting the table entries into file-local index.
Example
Suppose there are 3 tables:
- the base ES grammar, which has 106 entries at the time of generating the
.binjs
file - a proposal for adding Class fields grammar, which has 2 entries, and its namespace is "tc39/proposal-class-fields"
- a proposal for adding optimized LiteralNumericExpression, which has 4 entries, and its namespace is "binjs/optimized-numeric"
The file can merge the table by specifying the following information:
- the namespace: "tc39/proposal-class-fields" or "binjs/optimized-numeric"
- the start index for the namespace: 106 and 108
So that the table is merged like the following:
Index | Interface |
---|---|
0 | _Null |
1 | ArrayAssignmentTarget |
2 | ArrayBinding |
3 | ArrayExpression |
... | ... |
104 | YieldExpression |
105 | YieldStarExpression |
106 | ClassPublicField (from tc39/proposal-class-fields") |
107 | ClassPrivateField (from tc39/proposal-class-fields") |
108 | LiteralInt8Expression (from binjs/optimized-numeric) |
109 | LiteralInt16Expression (from binjs/optimized-numeric) |
110 | LiteralInt32Expression (from binjs/optimized-numeric) |
111 | LiteralFloatExpression (from binjs/optimized-numeric) |
The base ES grammar and both proposals may evolve over time, after the file is generated. But as long as the index for pre-existing entries keep same, and the merge is done from the smaller index, the merged table keeps same for the range used by the file.
The file-local remap can be done after the merge, in order to use smaller indices.
Solution for Cons 2
By providing the definition for each namespace in the external resource, possibly via the URL, the engine can retrieve the definition and can give users the detail about the error.
The engine can include the cache for base ES grammar, and as long as the index used by the file doesn't exceed the maximum index in the known table, it can use the cached table. Once the index used in the file goes beyond the known maximum index, or the file contains unknown namespace, the engine can retrieve the definition if necessary. (maybe only when showing the error)
Quick note: for current frameworks, the header is currently around 950b.
For the moment, I'm concentrating on what I hope will be larger gains: #126.