antlr4ts icon indicating copy to clipboard operation
antlr4ts copied to clipboard

Incremental Parsing support

Open dberlin opened this issue 6 years ago • 18 comments

Hey folks, This pull request adds incremental parsing support to antlr4ts. i'm explicitly not seeking to get merged in this exact form, i'm more curious whether y'all think it is worth the time/energy to try to get it merged into ANTLR (either the reference or optimized runtime).

I wrote the code for the typescript runtime first because i'm using it with vscode. I am happy to back port it to java.

As you can see, it is deliberately structured to be simple and small, have no extra dependencies outside of the existing runtime. It does not require modifying any existing part of the runtime.

I've tested it on fairly complex grammars (the included tests test it on the JavaLR grammar among others).

I added a doc (IncrementalParser.md) that explains how it works at a high level, as well as the outstanding issues. The tests currently test basic add/remove/etc on a simple grammar as well as the JavaLR grammar, and verify the right parts of the parse tree did what they were supposed to.

My use case is a little weird - i am parsing GCode files, which can get quite large (hundreds of megabytes). While ANTLR is quite fast, the parse time on a 6.5 meg gcode file is already 3-6 seconds (depending on whether parse trees are built). With the incremental parser, adding a new line is O(10ms).

We don't do incremental lexing but it's not difficult for a lot of languages to do by hand (particularly if you only care about text being right).

As mentioned, happy to do the work for TS/Java, and happy to push on it, just trying to understand if i'm the only one in the world who cares :)

Thanks for any thoughts/feedback!

dberlin avatar Apr 05 '19 22:04 dberlin

I have actually back ported this to ANTLR4 java and submitted it to the main repo. I will keep this up to date with that (and if it gets turned down, i'll close this)

dberlin avatar Apr 21 '19 22:04 dberlin

@dberlin this sounds interesting, I'll have a look.

BurtHarris avatar Jul 02 '19 03:07 BurtHarris

SGTM. Incremental parsing is in good shape.

I got slammed at work and then had another child, so i haven't had time to finish the incremental lexing.

dberlin avatar Jul 02 '19 03:07 dberlin

(the incremental lexing info can be found here: https://github.com/antlr/antlr4/issues/2534 The TL;DR is that it works but i did not finish changed token list generation. So incremental parsing and lexing work fine individually, they just don't automatically integrate)

dberlin avatar Jul 02 '19 03:07 dberlin

A bit late to the party, but this is super cool. Are there any plans on finishing this up and merging it?

AlexGustafsson avatar Apr 08 '20 11:04 AlexGustafsson

@dberlin, can you help me understand if incremental parsing can help with the a fundamental mismatch between the Java stream model (which uses blocking I/O) and the JavaScript stream model, where no blocking for I/O is permitted?

In JavaScript, rather than pulling data from a stream, data arrives in chunks, which are delivered by a callback (continuation passing style), or more recently using Promises. Promises have lead to language extension such as async/await where the code can look very much as if it supported blocking I/O, but it's an illusion.

BurtHarris avatar Apr 10 '20 18:04 BurtHarris

@dberlin can you help me understand the parts of this feature which currently require changes to the core library? I'm hoping to find a way that it can be used without needing to change the core code generator or runtime.

sharwell avatar Apr 10 '20 19:04 sharwell

Let me do my best to go in order.

@AlexGustafsson I won't have a chance to finish this up anymore (had another kid since then, and moving across the country) @sharwell It is not possible to do it without adding support to the core runtime in various places (IE adding Incremental* classes), or at least, it doesn't occur to me a way to do it. I can express what it's doing, and did my best to do that already in IncrementalParsing.md, happy to help understand any specifics.

I'm not as familiar with the ANTLR core runtime as y'all, i expected.

It may be possible to do this without modifying the code generator, but i'm not sure how to do it. What would have to happen would be to move the guard check out of the code generator, and into the core runtime. This would require further modifying IncrementalParser to try to make that work. I tried it at the start and it was non-obvious to figure out all the changes and places, so i gave up in favor of small obvious changes to the code generator.

Beyond that, a lot of the complexity is because of the intermediate use of a token stream. You could simplify all of this a lot more if you required incremental lexing, and drove the incremental lexing from the incremental parser as it walked. You could then, for example, get rid of IncrementalTokenStream and put most of the that into the IncrementalParser. It would also be a lot faster. What happens right now is that ANTLR parsers expect token streams to be complete, and seeks/skips by calling nextToken, which also blocks.

We waste a bunch of time and energy returning/tracking and processing unchanged tokens.

If instead the incrementalparser required an incrementallexer, it could be made to only ever request tokens for changed areas.

This would also cleanup the whole interface you see here. It would also let the incremental parser tell the lexer what next tokens would be acceptable, so that it knows whether it needs to relex further or not.

(All of this unfortunately requires random access to the underlying thing being lexed, but, on the plus side, can be made non-blocking as a result)

The incremental lexer changes i posted have an IncrementalLexer.md file that goes into this a bit.

If you want to see another implementation that uses a similar strategy to this , to see if you can figure out a way, take a look at tree sitter. I find it somewhat impenetrable, even knowing exactly how it works, but ...

It's also for LR, but both this and that are based on the same paper. The incremental lexing is driven by the parser in tree-sitter, but is otherwise identical algorithm to the incremental lexer i posted.

dberlin avatar Apr 10 '20 21:04 dberlin

@dberlin, Is this designed to deal with incremental parsing as in a portion of the text might have changed (like in an IDE doing syntax checking), or as in the input stream continues to deliver characters beyond those previously available?

BurtHarris avatar Apr 10 '20 23:04 BurtHarris

The former.

On Fri, Apr 10, 2020, 4:51 PM Burt Harris [email protected] wrote:

@dberlin https://github.com/dberlin, Is this designed to deal with incremental parsing as a portion of the text might have changed (like in an IDE doing syntax checking), or because the input stream continues to deliver characters beyond those previously available?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tunnelvisionlabs/antlr4ts/pull/414#issuecomment-612269942, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACPI24XNBXG2LH4EWLGLP3RL6WHRANCNFSM4HD7LT7Q .

dberlin avatar Apr 10 '20 23:04 dberlin

@sharwell, is there a pre-existing method in the code generator to alter what base class the generated class(es) are derived from?

BurtHarris avatar Apr 11 '20 00:04 BurtHarris

Let me differentiate capability from speed. It is capable of handling your case now.

Your case is equivalent to the case where all the text is added at the end, but it would be slow for the reasons the .md files cover (antlr's current way of doing this forces dealing with unchanged tokens (

It could be made to deal with this case very well if you did the "incremental parser drives incremental lexer" way described in those files.

In fact, it would be optimal and should be not slower than doing it all at once.

dberlin avatar Apr 11 '20 00:04 dberlin

(and for example, tree sitter, with the same algorithms, is used to parse character at a time)

dberlin avatar Apr 11 '20 00:04 dberlin

Yes there is.

I started with that approach, and if I remember correctly the context creation in each rule is side effecting or something that made it very hard to move the guard rule check from where it is. I probably still have the code around to do it that way somewhere if you want to see if you can make it work

On Fri, Apr 10, 2020, 5:03 PM Burt Harris [email protected] wrote:

@sharwell https://github.com/sharwell, is there a pre-existing method in the code generator to alter what base class the generated class(es) are derived from?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tunnelvisionlabs/antlr4ts/pull/414#issuecomment-612273072, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACPI253L5TPMLVX6J54MCDRL6XTZANCNFSM4HD7LT7Q .

dberlin avatar Apr 11 '20 00:04 dberlin

@dberlin: Returning to a higher level discussion, language support in IDEs like vscode often have two different lexers for the same language:

  1. An incremental syntax highlighting lexer which operates on a line-by-line basis, and who's only job it to classify tokens for color display purposes. These often function tightly integrated with the editor's buffer functionality.

  2. A full lexer and parser with semantic checking, which runs in a separate process (at least in vscode) to analyze for "problems" and generate the red squiggles. This process is a language service for the language.

Sub-second responsiveness in syntax highlighting is important, but highlighting only calls for an incremental lexer, no parsing, at least as I've seen it (in Visual Studio and Visual Studio Code.) In fact, the current trend seems to be to use TextMate grammars for highlighting purposes, as vscode supports.

Thus it's use-case two (where the lexer and parser are integrated) where this sort of incremental parsing begins to become interesting. If the typical parse time you get for a multi-megabyte g code file is measured in seconds, that seems pretty acceptable for use-case 2. So I think while this sort of incremental lexing and parsing may be interesting, there doesn't seem to be a pressing need for it.

I see this as different from streaming lexing and parsing I thought could be related to incremental parsing. The goal in streaming of managing back-pressure so that the memory requirements for a simple command-line tool don't expand excessively.

BurtHarris avatar Apr 11 '20 02:04 BurtHarris

P.S. because of the clever way ANTR's ALL(*) works, the recognizer may perform much better after its been warmed-up on similar input. Did the 6.5 meg g code file in 3-6 seconds include the warm-up overhead?

BurtHarris avatar Apr 11 '20 02:04 BurtHarris

Hey Burt,

(The ANTLR timings are for second+ parse. First parse is a lot slower due to the need to build out the state cache, etc, as you say.)

I'm fairly familiar with this style, i've worked on a lot of IDE's in the past 25 years. My day job is actually owning IDE's, programming languages, and a few other things for Google :)

A few things:

  1. Lexing is actually not good enough anymore for a lot of languages, and in fact, there is a large and growing chorus of complaints against vscode to move to tree-sitter as a result (which does sub-second responsiveness with parsing). See the long discussion starting here: https://github.com/microsoft/vscode/issues/50140, for example (there are other issues here).

See https://marketplace.visualstudio.com/items?itemName=georgewfraser.vscode-tree-sitter for an example comparing syntax highlighting. Note that the tree-sitter version is not just better, it's actually faster than the textmate grammar engine in vscode by far. (even though vscode is doing delayed update and tree-sitter is doing reparse on every keystroke)

Truthfully, at least in the IDEs i worked on, lexing never really was good enough, its just that outside of Pan/Ensemble/Harmonia research projects, nobody bothered to make parsing fast enough (I actually worked in some incremental parsing IDEs in the past).

Other IDEs than vscode, such as Atom, already use tree-sitter as an incremental parsing systems. See https://github.blog/2018-10-31-atoms-new-parsing-system/

  1. Atom and other IDEs that use tree-sitter (again, same incremental algorithms here) do reparsing on a per-keystroke basis, and have no issue even with multi-megabyte files. I have a tree-sitter grammar for the same file, and the parsing/lexing time on changes is zero.

So overall i'd say "I believe what you say has been true in the past, i definitely would not agree it is true anymore". In fact, i'd put a significant amount of money that tree-sitter will take over the IDE space (This is an easy bet to make because it already is outside of a few mainstays :P)

But, in the end, what y'all choose to do is up to you. Feel free to close this if you aren't interested. I wrote it mostly because i've been using ANTLR since the PCCTS days, and it was fun. I already have tree-sitter grammars for my work that I can use in IDEs (including vscode), etc.

On Fri, Apr 10, 2020, 7:13 PM Burt Harris [email protected] wrote:

Returning to a higher level discussion, language support in IDEs like vscode often have two different lexers:

An incremental syntax highlighting lexer which operates on a line-by-line basis, and who's only job it to classify tokens for color display purposes. 2.

A full lexer and parser with semantic checking, which runs in a separate process to generate the "problems" and red squiggles. This process is a language service for the language.

Sub-second responsiveness in syntax highlighting is important, but highlighting only calls for an incremental lexer, no parsing, at least as I've seen it (in Visual Studio and Visual Studio Code.) In fact, the current trend seems to be to use TextMate grammars https://macromates.com/manual/en/language_grammars for highlighting purposes, as vscode supports.

Thus it's use-case two (where the lexer and parser are integrated) where this sort of incremental parsing begins to become interesting. To be frank, if a multi-megabyte g code file parses in 10s of seconds, I think the priority of adding incremental support wouldn't come in an IDE environment.

If the typical parse time you get for a multi-megabyte g code file is measured in seconds, that seems pretty acceptable for use-case 2. So I think while this sort of incremental lexing and parsing may be interesting, there doesn't seem to be a pressing need for it.

I guess this is different from streaming lexing and parsing I thought could be related. The goal in streaming of managing back-pressure so that the memory requirements for a simple command-line tool don't expand excessively.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tunnelvisionlabs/antlr4ts/pull/414#issuecomment-612299008, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACPI25DFMAPNTUORM33L3LRL7G33ANCNFSM4HD7LT7Q .

dberlin avatar Apr 11 '20 05:04 dberlin

Thanks @dberlin, you make a good case. I certainly am a bit out-of-date, I retired from MSFT a number of years back, and IDEs were never my focus.

I think introducing incremental parsing to ANTLR thru Java antlr/antlr4#2527 is the way to go. It looks like that PR needs some minor rebasing, and response to a threading concern.

@sharwell (and of course @parrt) are really the ANTLR4 / ALL(*) experts. I just tackled antlr4ts to get some hands-on experience with Typescript, and I didn't care for the existing JavaScript target last time I tried it. Anyway, take my input with a grain of salt, I am no expert.

I am a little surprised Google's interested in GCode IDE, or is that a side project? Is there anything you can point me at about that?

BurtHarris avatar Apr 14 '20 23:04 BurtHarris