Background

The multipart format for .binjs file contains [GRAMMAR] section, that defines the table for index for each interface (or grammar name?) used in the file, like, 0 => CallExpression.

Problem

Currently there are 106 interfaces in BinSource.webidl_ file, and enumerating all of them takes ~2kB for each file.

Solution

Pre-define the table, and store it inside the engine.

Pros

The .binjs file size can be reduced by up to 2kB, before compression
The time taken by parsing [GRAMMAR] section can be reduced

Cons

If the number of interfaces becomes more than 127 in the future, index for some interface will take 2bytes, and [TREE] may take more spaces compared to the current format
If unknown grammar is used in the file, the engine cannot show the detail about the error (like, which syntax is not supported)

Optimization for solving the Cons 1

Embed an information for remapping the index, in [GRAMMAR] section.

Example

Consider that the predefined table looks like the following (just an example, the actual table should use more efficient order):

Index	Interface
0	_Null
1	VariableDeclaration
2	CallExpression
3	WhileStatement
4	SwitchCase
...	...
50	ExpressionStatement
...	...
180	IdentifierExpression
...	...
200	LiteralNumericExpression
...	...
260	Script
...	...

The [TREE] for print(10); will look like the following byte stream:

28                       # treeByteLen = 21
F5 02                    # = 260 (Script)
00                       # Script.OptionalAssertedVarScope = _Null
00                       # Script.ListOfDirective.items = 0
02                       # Script.parseListOfStatement.items = 1
64                       # = 50 (ExpressionStatement)
04                       # = 2 (CallExpression)
69 02                    # = 180 (IdentifierExpression)
00                       # IdentifierExpression.name = "print"
02                       # Arguments.items = 1
91 02                    # = 200 (LiteralNumericExpression)
00 00 00 00 00 00 24 40  # LiteralNumericExpression.value = 10

There, IdentifierExpression, LiteralNumericExpression, and Script are taking 2 bytes, while most interfaces with 1 bytes are not used there.

If we remap the table like the following:

Index	Interface
0	_Null
1	IdentifierExpression
2	CallExpression
3	LiteralNumericExpression
4	Script
...	...
50	ExpressionStatement
...	...
180	IdentifierExpression
...	...
200	LiteralNumericExpression
...	...
260	Script
...	...

The [TREE] for print(10); will become the following byte stream:

24                       # treeByteLen = 18
08                       # = 4 (Script)
00                       # Script.OptionalAssertedVarScope = _Null
00                       # Script.ListOfDirective.items = 0
02                       # Script.parseListOfStatement.items = 1
64                       # = 50 (ExpressionStatement)
04                       # = 2 (CallExpression)
02                       # = 1 (IdentifierExpression)
00                       # IdentifierExpression.name = "print"
02                       # Arguments.items = 1
06                       # = 3 (LiteralNumericExpression)
00 00 00 00 00 00 24 40  # LiteralNumericExpression.value = 10

The remap can be encoded like the following byte stream:

69 02  # from: 180 (IdentifierExpression)
02     # to:   1
91 02  # from: 200 (LiteralNumericExpression)
06     # to:   3
F5 02  # from: 260 (Script)
08     # to:   4

In this small example, the actual file size increases tho, in the large code, the size taken by the remap information will be negligible, compared to the [TREE] data reduction.

Encoder can choose whether to remap or not, depending on the file content. They can remap interfaces with 2bytes index only when the interface is used many times in the file content.

Handling grammar evolution/extensions

For the base ES grammar, the grammar table is supposed to extend over time, keeping the index for pre-existing interfaces same.

For proposals, including the ECMAScript spec proposals and BinAST-specific optimization etc, there can be multiple proposal at once, so the table can temporarily branch.

This can be solved by the namespace (thanks to @RReverser), and merging table.

Namespace

Each proposal should have their own namespace (maybe URL-like thing), and namespace-local table (index => interface).

By keeping the table's pre-existing entry's index same, we don't have to add version number for the namespace.

Merging table

Supposing the pre-existing index keeps same over time, the merging can be done just by putting the table entries into file-local index.

Example

Suppose there are 3 tables:

the base ES grammar, which has 106 entries at the time of generating the .binjs file
a proposal for adding Class fields grammar, which has 2 entries, and its namespace is "tc39/proposal-class-fields"
a proposal for adding optimized LiteralNumericExpression, which has 4 entries, and its namespace is "binjs/optimized-numeric"

The file can merge the table by specifying the following information:

the namespace: "tc39/proposal-class-fields" or "binjs/optimized-numeric"
the start index for the namespace: 106 and 108

So that the table is merged like the following:

Index	Interface
0	_Null
1	ArrayAssignmentTarget
2	ArrayBinding
3	ArrayExpression
...	...
104	YieldExpression
105	YieldStarExpression
106	ClassPublicField (from tc39/proposal-class-fields")
107	ClassPrivateField (from tc39/proposal-class-fields")
108	LiteralInt8Expression (from binjs/optimized-numeric)
109	LiteralInt16Expression (from binjs/optimized-numeric)
110	LiteralInt32Expression (from binjs/optimized-numeric)
111	LiteralFloatExpression (from binjs/optimized-numeric)

The base ES grammar and both proposals may evolve over time, after the file is generated. But as long as the index for pre-existing entries keep same, and the merge is done from the smaller index, the merged table keeps same for the range used by the file.

The file-local remap can be done after the merge, in order to use smaller indices.

Solution for Cons 2

By providing the definition for each namespace in the external resource, possibly via the URL, the engine can retrieve the definition and can give users the detail about the error.

The engine can include the cache for base ES grammar, and as long as the index used by the file doesn't exceed the maximum index in the known table, it can use the cached table. Once the index used in the file goes beyond the known maximum index, or the file contains unknown namespace, the engine can retrieve the definition if necessary. (maybe only when showing the error)

May 01 '18 01:05 arai-a

Quick note: for current frameworks, the header is currently around 950b.

May 07 '18 15:05 Yoric

For the moment, I'm concentrating on what I hope will be larger gains: #126.

May 11 '18 06:05 Yoric

binjs-ref
binjs-ref copied to clipboard

Pre-defined grammar table with file-local remap in multipart .binjs file

Background

Problem

Solution

Pros

Cons

Optimization for solving the Cons 1

Example

Handling grammar evolution/extensions

Namespace

Merging table

Example

Solution for Cons 2

binjs-ref binjs-ref copied to clipboard

Pre-defined grammar table with file-local remap in multipart .binjs file

Background

Problem

Solution

Pros

Cons

Optimization for solving the Cons 1

Example

Handling grammar evolution/extensions

Namespace

Merging table

Example

Solution for Cons 2

binjs-ref
binjs-ref copied to clipboard