antlr4 icon indicating copy to clipboard operation
antlr4 copied to clipboard

Json trees

Open parrt opened this issue 2 years ago • 15 comments

See https://github.com/antlr/antlr4/discussions/3772

parrt avatar Jul 04 '22 19:07 parrt

Are there any tree walkers or visitors that can utilize the JSON parse trees?

HSorensen avatar Jul 05 '22 14:07 HSorensen

I think it's up for runtime.

KvanTTT avatar Jul 05 '22 14:07 KvanTTT

Are there any tree walkers or visitors that can utilize the JSON parse trees?

Any Target language that knows how to read json, should be able to pull these in and walk the trees recursively. I will have to build one in JavaScript as I'm trying to build a server / client webpage that communicates using this format.

parrt avatar Jul 05 '22 17:07 parrt

Any Target language that knows how to read json, As you of course already know using either the tree walker or visitor patterns is just so much more efficient.

HSorensen avatar Jul 05 '22 18:07 HSorensen

@HSorensen yep, what I meant was somebody will have to deserialize the json into a proper parse tree and then the usual visitor in listener patterns will work great. This is only for sending stuff across a wire. If it's in memory this is all unnecessary.

parrt avatar Jul 05 '22 18:07 parrt

@KvanTTT looking better, right?

parrt avatar Jul 05 '22 19:07 parrt

Yes, separated class looks better.

KvanTTT avatar Jul 05 '22 20:07 KvanTTT

Added sample output and python parsing of json here: https://github.com/antlr/antlr4/discussions/3772

parrt avatar Jul 05 '22 20:07 parrt

This is really good stuff. Any work on the deserialization side? Do you think that's a bigger task?

JamesRTaylor avatar Nov 16 '22 02:11 JamesRTaylor

Any work on the deserialization side? Do you think that's a bigger task?

Hi. Haven't done any work on deserialization. sorry.

parrt avatar Dec 10 '22 18:12 parrt

There's a much better implementation I have for serialization in the antlr4-lab: https://github.com/antlr/antlr4-lab/blob/master/src/org/antlr/v4/server/JsonSerializer.java I hope to eventually fold this back into Antlr.

parrt avatar Dec 10 '22 18:12 parrt

BTW, I've spent probably two or three years going through different implementations for the parse tree representation and serialization. After working on tree rewriting problems, I've come to the conclusion that the Antlr tree/tokenstream/chastream/interval implementation is definitely not the best representation for tree rewriting, especially if there are hundreds of edits to do: keeping it all consistent is very time consuming, and very tedious. I've settled on a tree decorated with text and attribute nodes for tokens and skip and off-channel text. Plus it is more easily adapted to XPath and XSLT engines.

kaby76 avatar Dec 10 '22 21:12 kaby76

I stopped doing tree rewriting for transformation purposes, and now use either token stream rewriting, or simply creating an internal model, and then generating code from there

On Sat, Dec 10, 2022 at 1:05 PM Ken Domino @.***> wrote:

BTW, I've spent probably two or three years going through different implementations for the parse tree representation and serialization. After working on tree rewriting problems, I've come to the conclusion that the Antlr tree/tokenstream/chastream/interval implementation is definitely not the best representation for tree rewriting, especially if there are hundreds of edits to do: keeping it all consistent is very time consuming, and very tedious. I've settled on a tree decorated with text and attribute nodes for tokens and skip and off-channel text. Plus it is more easily adapted to XPath and XSLT engines.

— Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/pull/3773#issuecomment-1345384397, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABLUWKBAAUQO6XVSLBZW33WMTWDJANCNFSM52UCGZZA . You are receiving this because you authored the thread.Message ID: @.***>

-- Dictation in use. Please excuse homophones, malapropisms, and nonsense.

parrt avatar Dec 11 '22 01:12 parrt

My use case was fast deserialization of a parse tree with as compact as possible encoded data. This is using Go. The end goal was to make deserialization significantly faster than re-parsing the original string. I was able to reduce the encoded data to about 80% compared to the string and reduce the decode time to about 35% of the parse time. In the end, it wasn't significantly faster than re-parsing (kudos to the parser!) to justify the extra code and limitations imposed on grammar writing (see below). It did show some promise, though.

The approach I took was to:

  1. Serialize the parse tree (inspired by the serialization code) as a combination of token and rule indexes (not exactly a rule index as I needed to handle grammar # tags too).
  2. Generate code to enable deserialization by introspecting the visitor. The input was the original string (since in our use case this was always going to be persisted and available) and the output was the result of walking the visitor.
  • Re-tokenize the original string
  • Deserialize the parse tree (calling generated code using rule index)
  • Re-walk the visitor to produce the domain objects

The one limitation I had was that grammar variables were problematic in that I had no good way to re-establish their state. I could have serialized and deserialized them, but that would have bloated the encoded data pretty significantly (though of course that depends on how your grammar was written). I chose to just not use grammar variables in my tests.

JamesRTaylor avatar Dec 13 '22 19:12 JamesRTaylor

Thanks for the info.

The problem I'm working on, at the moment, is the scrape and conversion of the grammar for Python3, in Pegen syntax, to Antlr4. The parse of the Python3 grammar in Pegen syntax takes ~2.6s on a speedy machine--so long because the rules in a Pegen grammar do not have a rule terminator (e.g., the ';' at the end of a rule in Antlr4 grammars). Serialization of the parse tree, parser, and lexer tables takes ~0.03s, deserialization ~0.05s. The parse tree itself was changed to not use tokenstream/charstream/indices, but instead docorate the parse tree with text and attribute nodes for default channel and off-channel tokens and character strings. This representation allows for much faster tree node edits. In fact, the parse takes more time than deserialization, serialization, deleting and inserting hundreds of nodes involved in converting the grammar to Antlr4 syntax.

kaby76 avatar Dec 13 '22 21:12 kaby76