AntlrVSIX
AntlrVSIX copied to clipboard
Replace in-process Trash commands with out-of-process commands
In order to be less monolithic, Trash needs to move away from in-process commands. All commands should be implemented as separate programs, just like everything in Cygwin or Ming. However, unlike Bash, trees are passed. Perhaps this can be implemented using a memory-mapped file? I hate to use stdin/out and a serialized tree.
Although it is noted in https://github.com/antlr/antlr4/issues/1006 and https://github.com/antlr/antlr4/issues/233 that there's no serialization of parse trees, it's easily remedied with a custom JSON serializer. So, this is the first step to know whether passing large JSON will work.
I implemented a verbose json serializer and deserializer. Although this is just anecdotal evidence, writing out a serialization is very quick, but reconstructing the parse tree is not, based on testing the parse of PlSqlParser.g4: < 1s serialize, ~6s deserialize. (The implementation uses MS's System.Text.Json and System.Text.Json.Serialization libraries, with custom serializer code. This implementation serializes both the tree and the token stream. The token stream accounts for 10MB, while the parse tree accounts for 20MB.)
I've been thinking about serialization. The argument goes that there is no support in Antlr for serialization because the text can be re-parsed quickly. In other words, the text is the serialization of the tree. While it is a serialization, there are other concerns:
- Not all parses are fast.
- If the grammar is unoptimized, or if the parse is coupled with semantic analysis, then reproducing the tree will be slow.
- If you want to decouple the tree from the implementation then you have no option. Antlr is written for specific targets, but not all languages are supported. Adding in change requests to Antlr is extremely slow.
- You want to use the serialization for unit testing.
On the implementation, I'll start first with the implementation I am now working on.
- The serialization is a JSON file. The token stream is represented as an array of string and an array of integers, where the concatenation of the strings is the file, and the array of integers contains all the token type. For PowerBuilderLexer.g4 (my copy, 156KB of source code), the stream is represented in 480KB, which is a 3x expansion factor. I could create a string table, and lower that somewhat. This JSON representation contains no indentation or spaces.
The best representation I can devise is a "canonical representation"(COCKAYNE, EJ. "Linear Algorithms on Recursive Representations of Trees." JOURNAL OF COMPUTER AND SYSTEM SCIENCES 18 (1979): 76-85.) Basically, it is a list of parents for each node, with the array sorted in a preorder traversal. The size of the edge list will be equal to the number of nodes in the tree. Certainly, you don't want to serialize the tree and represent it as adjacency lists.
I now have a FAST!!!! implementation for serialization/deserialization.
@kaby76 Hi,kaby76,i have the same problem. my input is 889Byte and after parse,the ParseTree is 63KB, expansion factor nearly 6x. i tried java's native serialization/deserialization. as you say it's slow then reparse the input use antlr again.
i saw you say "I now have a FAST!!!! implementation for serialization/deserialization." !do you have any new idea about this problem? wish your response!
@kaby76 Hi,kaby76,i have the same problem. my input is 889Byte and after parse,the ParseTree is 63KB, expansion factor nearly 6x. i tried java's native serialization/deserialization. as you say it's slow then reparse the input use antlr again.
i saw you say "I now have a FAST!!!! implementation for serialization/deserialization." !do you have any new idea about this problem? wish your response!
Antlrvsix was moved over to Trash, and the parse tree completely rewritten. The new representation is just a nested JSON object, with children in an array. In addition, the "token stream" concept was removed completely because it was a terrible representation for tree editing. So, a parse tree, along with some other information about the grammar, is now in a JSON file. Here is an example:
ooo.txt. This file is generated by Trash's trparse with the --fmt
option. The parse tree itself starts at the root of the parse tree, on line 1416. A parse node is serialized as:
- int32, the type of node.
- int32, the parser rule number.
- string, the name of the parser rule.
- children
It can probably be improved, since the parser rule name can be derived from the parser rule number.
In any case, the representation is more than fast enough for most purposes, which are usually just queries using XPath expressions, but also speedy for tree editing.