congo-parser-generator In fault tolerant mode token-level recovery erroneously extends beyond the bounds of an enclosing lexical state change

In an expansion like "[" LEXICAL_STATE OTHER (Foo) "]" if OTHER does not contain a valid token for ] the consumeToken method will encounter the ], consider it to be unexpected, and attempt to recover by skipping it. After that chaos can ensue, depending on what follows the ]. I intend to fix this, as I am working on some tangental things, but I wanted to record it here so we don't forget. If you, Jon, want to look at in the meantime, that's great. For now I'm just working around this.

I'm thinking something like if the outer follow set for the expansion within the state change contains token(s) not in the lexical state they should not be included, and an INVALID token should replace them. Or, maybe, the follow set be just {INVALID} in this circumstance.

Mar 17 '25 15:03 adMartem

I'm not sure my description of the problem is correct, but there is definitely a problem that is fixed by including the out-of-scope ] token in the in-scope token set.

Mar 17 '25 16:03 adMartem

I'm pretty sure I understand the bug. But I'm not sure why it doesn't work. Let's see... it optionally consumes a Foo, but it needs to switch into the OTHER lexical state to see whether it can go into Foo. If it can't, then it should go back to the original lexical state and consume a "]". I mean, it just seems that the problem is that it's not reverting to the original lexical state when it should. And that should be a fairly straightforward bug to fix, I reckon. But, by all means, have a look and see if you can nail it. If not, I'll have a look at some point in the next few days, I guess.

Oh, but hold on, just to be clear. One question: is it failing to do what it should when there is a Foo there or when there is no Foo there? Or is it in either case...

Mar 17 '25 17:03 revusky

First, skip to my last message, then come back here.

So, this grammar actually works:

PARSER_PACKAGE=org.parsers.test;
TAB_SIZE=4;
ENSURE_FINAL_EOL;
TERMINATING_STRING="";
#define FT
#if FT
FAULT_TOLERANT;
FAULT_TOLERANT_DEFAULT=true;
#endif

<DEFAULT,OTHER> UNPARSED : 
    < #LINE_TERMINATOR: ( [ "\n", "\r" ] | "\r\n" ) > 
    |
    < #INLINE_WHITESPACE: ( [ " ", "\t", "\f" ] )+ > 
    |
    < WHITESPACE: <INLINE_WHITESPACE> | <LINE_TERMINATOR> >
; 

<DEFAULT> UNPARSED :
    < COMMA: [ ",", ";" ] >
;
<OTHER> TOKEN : 
    <FOO: "foo">
    |
    <BAR: "bar">
;

FooBar : 
    (<FOO>! | <BAR>!)+ 
;

FooBars : 
    (
        "[" LEXICAL_STATE OTHER (FooBar #Other) "]"
    )+
;

#Root : 
    FooBars
    <EOF> {return CURRENT_NODE;}
;

With this input:

[foo]
[foo bar]
[bar foo]
[foo foo]
[foo foo foo foo]
[bar bar bar bar]
[bar bar]
[foo bar]
[bar foo]
[bar]

Mar 17 '25 21:03 adMartem

Now that I have this much simpler test case, I'll probably be able to figure out what is happening, but you probably will know the right way to fix it. Right now I'm trying to clean up and DOCUMENT(!) a new feature I will push. [Teaser] It is called Repetition Cardinality Assertions.

Mar 17 '25 21:03 adMartem

Ok, my simple test case doesn't really reproduce the problem I am seeing. I forgot to change the whitespace to include the OTHER state. When I do, it (accidentally, I think) parses the input and produces the correct tree. But now that I'm tracing almost everything, I see that the real problem (unless I convince myself it is my problem) has to do with the interaction of nextToken (which causes tokens to become unparsed in some cases) and the consumeToken which invokes nextToken when probing future token(s). More later...

Mar 19 '25 20:03 adMartem

Jon, do we really intend for nextToken to change dirty tokens that follow an unparsed token to "unparsed" in lookahead?

Mar 19 '25 21:03 adMartem

Jon, do we really intend for nextToken to change dirty tokens that follow an unparsed token to "unparsed" in lookahead?

To be honest, I really don't know. Probably I have to refamiliarize with the code and try to remember what I was thinking at the time. I'm a bit absorbed in some other things right now, but I'll try to have a look soon. Could you link the exact point in the code where this is happening?

Mar 19 '25 22:03 revusky

https://github.com/congo-cc/congo-parser-generator/blob/ca93147bbcb74ac2e5ca134d8bb5fd80872d4483/src/templates/java/Parser.java.ftl#L152-L162

Mar 20 '25 15:03 adMartem

That, I think, is causing the calls to nextToken during lookahead to turn invalid tokens (beyond the lexical scope in effect) into unparsed tokens, which are then passed over in parsing. I'm not positive, but this is currently my most likely scenario. The failure-mode case, so far, has been intractable by debugging but I will continue to try and isolate it to the smallest possible case. I could easily be misinterpreting it, though. This fault tolerant behavior is very similar to the kind of error recovery encountered in data communication protocols, which can be very hard to analyze and fix (as I alluded to in the previous note I deleted). Anyway, no hurry, it is not holding up anything here.

Mar 20 '25 16:03 adMartem

congo-parser-generator/src/templates/java/Parser.java.ftl

Lines 152 to 162 in ca93147

while (result.isUnparsed()) { #list grammar.parserTokenHooks as methodName result = ${methodName}(result); #endlist result = token_source.getNextToken(result); #if settings.faultTolerant if (result.isInvalid()) { if (isParserTolerant()) { result.setUnparsed(true); } }

Here I am finally looking at this over two months later and I have to admit that I'm confused. I can't remember exactly what I was thinking. I must have been thinkinng that if we're in a fault-tolerant mode, we just treat an invalid token as something ignorable, like whitespace typically is, i.e. unparsed. But I have to admit that I'm confused about the lookahead issue.

Now, we can condition things based on whether we are in lookahead or not pretty easily. We could have something like:

#if settings.faultTolerant 
 if (result.isInvalid()) { 
    if (currentLookaheadToken == null && isParserTolerant()) { 
      result.setUnparsed(true); 
    } 
  }

So, in the above, we only set the token to "unparsed" if we are not in lookahead, i.e. currentLookaheadToken==null. But I don't know if that would resolve your problem! But, of course, if you want to experiment with that. That said, my gut feeling is that this is not really the solution. We have a problem that it is tokenizing ahead without switching the lexical state back, no?

But... this is only a problem in fault-tolerant, right? The code generated for non-fault-tolerant is okay, no?

Well, sorry for the wooly-headed comments. I am just trying to get back into this and need a bit more time to get my head straight. What happens if you turn off that thing completely, like instead of #if settings.faultTolerant, just replace it for now with #if false. That could be something else to try.

Jun 04 '25 18:06 revusky

Hi Jon, it has been awhile, so I'm now as wooly-headed as you were. As I recall, it was really hard for me to get my head around what was going on, and I resorted to extensive tracing to be sure what was happening. I think(?) it only happens in FT mode, and I think my opinion at the time was that it is silly to do this in lookahead, as you are trying to validate a path for the parser to follow, and yet you are allowing it to fabricate things that might not be needed if you just failed the lookahead and followed another path that might actually work without fault tolerant changes. In other words, why are we attempting to "correct" a lookahead without knowing that if this choice is failed, might not a subsequent one succeed?

But I'll try to refresh my memory and review the code to see if the above has any merit. For some reason, this particular area of the code is very difficult to comprehend. Something about it's modality maybe, or just that looking at the template just confuses me.

On Wed, Jun 4, 2025 at 1:22 PM Jonathan Revusky @.***> wrote:

revusky left a comment (congo-cc/congo-parser-generator#203) https://github.com/congo-cc/congo-parser-generator/issues/203#issuecomment-2940976175

congo-parser-generator/src/templates/java/Parser.java.ftl https://github.com/congo-cc/congo-parser-generator/blob/ca93147bbcb74ac2e5ca134d8bb5fd80872d4483/src/templates/java/Parser.java.ftl#L152-L162

Lines 152 to 162 in ca93147 http:///congo-cc/congo-parser-generator/commit/ca93147bbcb74ac2e5ca134d8bb5fd80872d4483

while (result.isUnparsed()) { #list grammar.parserTokenHooks as methodName result = ${methodName}(result); #endlist result = token_source.getNextToken(result); #if settings.faultTolerant if (result.isInvalid()) { if (isParserTolerant()) { result.setUnparsed(true); } }

Here I am finally looking at this over two months later and I have to admit that I'm confused. I can't remember exactly what I was thinking. I must have been thinkinng that if we're in a fault-tolerant mode, we just treat an invalid token as something ignorable, like whitespace typically is, i.e. unparsed. But I have to admit that I'm confused about the lookahead issue.

Now, we can condition things based on whether we are in lookahead or not pretty easily. We could have something like:

#if settings.faultTolerant if (result.isInvalid()) { if (currentLookaheadToken == null && isParserTolerant()) { result.setUnparsed(true); } }

So, in the above, we only set the token to "unparsed" if we are not in lookahead, i.e. currentLookaheadToken==null. But I don't know if that would resolve your problem! But, of course, if you want to experiment with that. That said, my gut feeling is that this is not really the solution. We have a problem that it is tokenizing ahead without switching the lexical state back, no?

But... this is only a problem in fault-tolerant, right? The code generated for non-fault-tolerant is okay, no?

Well, sorry for the wooly-headed comments. I am just trying to get back into this and need a bit more time to get my head straight. What happens if you turn off that thing completely, like instead of #if settings.faultTolerant, just replace it for now with #if false. That could be something else to try.

— Reply to this email directly, view it on GitHub https://github.com/congo-cc/congo-parser-generator/issues/203#issuecomment-2940976175, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZIWIMOVKLFT7NTSHQRDFGD3B42OBAVCNFSM6AAAAABZF2RZKOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSNBQHE3TMMJXGU . You are receiving this because you were assigned.Message ID: @.***>

Jun 04 '25 21:06 adMartem

After reviewing some of the tests and recalling some things, I'm sure the real question I ended up with was whether or not fault tolerance in lookahead made any sense at all, since deleting or adding anything could result in a false choice being made for which there was a perfectly fine subsequent alternative that was missed. Like looking ahead with distorting glasses.

Jun 12 '25 00:06 adMartem

After reviewing some of the tests and recalling some things, I'm sure the real question I ended up with was whether or not fault tolerance in lookahead made any sense at all, since deleting or adding anything could result in a false choice being made for which there was a perfectly fine subsequent alternative that was missed. Like looking ahead with distorting glasses.

You're probably right. Everything to do with fault-tolerance, if it actually works, it's a minor miracle surely. By the way, did you ever notice this? I saw that some months ago and meant to drop a note to those people, but did not get round to it. According to their account, this fault tolerant stuff in CongoCC did work for them, but I don't really know quite how! Well, if it somehow stumbled through and built an AST from whatever broken code, maybe that was good enough for them. I don't really know.

Regarding that patch I made recently to the ATTEMPT/RECOVER logic, yeah, that is just totally FUBAR! At one point, I wrote token_source.activateTokenTypes when it should have been token_source.activeTokenTypes. I guess the reason it didn't come to my attention is that, since the code was never used, it was getting reaped by the reaper before the compilation step! Otherwise, it wouldn't have even been compilable. So, I obviously cannot object to your fixing that!

Jun 12 '25 18:06 revusky

congo-parser-generator congo-parser-generator copied to clipboard

In fault tolerant mode token-level recovery erroneously extends beyond the bounds of an enclosing lexical state change

congo-parser-generator
congo-parser-generator copied to clipboard