antlr4 Addition of unrelated, unused parser rule to grammar causes different parsing results

Addition of unrelated, unused parser rule to grammar causes different parsing results

Open kaby76 opened this issue 1 year ago • 11 comments

This is related to a SO question https://stackoverflow.com/q/73522066/4779853

The grammar is poorly written as an EOF-start rule, but that is irrelevant because parsing results should always be consistent when a completely unrelated, unused parser rule is added or deleted. Therefore there is something wrong in the runtime.

grammar my; 
st: 'H' hd | EOF ;
hd: 'D' d | 'C' c | st ;
d: hd ;
c: 'D' c | hd ;
s1: 'D' s1 | c ;
// p: hd ;
SKP: [ \t\r\n]+ -> skip;

Input H C D C C D

Commented rule p causes parser to work one way, uncomment rule p works another way. It is illogical for a rule that is unused to influence a parse because it's never used in a derivation. It must be investigated.

Aug 31 '22 00:08 kaby76

It looks like the similar bug was already reported: https://github.com/antlr/antlr4/issues/2179 But it's about weird error instead of different parsing result.

Aug 31 '22 14:08 KvanTTT

A random unused rule can cause FOLLOW sets to be different if I recall. Yeah, this grammar is pretty messed up. The issue relates to whether or not lookahead "falls off the edge" ever. Because of the tangle of rules (ref to hd), there is no rule w/o a follow set. adding p means adding a place where hd can be followed by nothing. W/o swapping in more knowledge from cold storage I can't explain any better.

Aug 31 '22 23:08 parrt

Ah. i think it's the LL(1) analyzer. See comments on _LOOK(). EOF can be added to FOLLOW set depending on hitting end of a rule. Not saying code is correct exactly but I can't remember enough to resolve.

Sep 01 '22 00:09 parrt

This is legal in Antlr. What does it mean?

grammar T3;
s : 'a' EOF 'c' | 'b' ;

I'm not sure whether it's a good idea to allow rules with EOF that

are in one alt and missing in another;
or allow symbols to follow an EOF. It's not clear what these mean, and Antlr has a problem in grammar my where commented and uncommented rule p yields two different trees.

If it is allowed, it must be crystal clear what they mean. (I have an xpath that should flag such rules, but trxgrep has a bug that I now have to fix.)

Sep 01 '22 12:09 kaby76

s : 'a' EOF 'c' | 'b' ;

I think such code should be disallowed. EOF might be placed only at the end position of the rule. EOF_CLOSURE error should be extended for covering such cases.

Sep 01 '22 14:09 KvanTTT

True, but then you'd also want to disable combination of rules that have an EOF not at the end. Tricky imho...

Sep 01 '22 14:09 ericvergnaud

are in one alt and missing in another;

~~Why do you think it should be disallowed? It's quite real code.~~

Yes, it looks like it also doesn't make sense. If it's encountered, the grammar should be rewritten to prevent potential weird behaviour and errors.

or allow symbols to follow an EOF.

Agree with that. It's nonsense since the part after EOF is unreachable.

True, but then you'd also want to disable combination of rules that have an EOF not at the end. Tricky imho...

At least not ending EOF can be disallowed within one rule, it's easy. Detecting not ending EOF within the entire grammar is also possible. To implement it, containsEOF flag should be considered for every rule. By the way, I've implemented the similar warnings but for beginning position: PREDICATE_AT_THE_BEGINNING_OF_LEXER_RULE_DEGRADES_PERFORMANCE.

I suggest introducing the new errors together with warnings from https://github.com/antlr/antlr4/pull/3626, they are processed in the same way.

Sep 01 '22 14:09 KvanTTT

https://github.com/antlr/antlr4/discussions/3763 -- we could just take it out of the hands of users. :)

Sep 01 '22 15:09 kaby76

Hi guys...any change to the secret sauce parsing algorithm would require me to fully understand it again. Not sure I am willing to dig deep enough. Changing the grammar semantics for EOF is less risky but would likely break backward compatibility. Also I've just never had a problem putting EOF on end of a start rule or leaving it off so I can parse partial phrases.

Sep 01 '22 19:09 parrt

I'd rather we focus on larger issues than grammar warnings that I might find too risky to merge. Maybe help with antlr4-lab?

Sep 01 '22 19:09 parrt

Some good news.

There are no rules with EOF followed by another element in grammars-v4.
There are a few grammars where an EOF occurs on one alt but not in another. I'll take a look at these.

./cool/COOL.g4 has an EOF in one alt, but not in another.
./csharp/Antlr4cs/CSharpPreprocessorParser.g4 has an EOF in one alt, but not in another.
./csharp/CSharp/CSharpPreprocessorParser.g4 has an EOF in one alt, but not in another.
./csharp/CSharpPreprocessorParser.g4 has an EOF in one alt, but not in another.
./golang/Go/GoParser.g4 has an EOF in one alt, but not in another.
./golang/GoParser.g4 has an EOF in one alt, but not in another.
./hypertalk/HyperTalk.g4 has an EOF in one alt, but not in another.
./java/java/JavaParser.g4 has an EOF in one alt, but not in another.
./javadoc/JavadocParser.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/CSharp/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/CSharpSharwell/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/Go/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/JavaScript/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/Python/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/TypeScript/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/javascript/JavaScriptParser.g4 has an EOF in one alt, but not in another.
./javascript/jsx/JavaScriptParser.g4 has an EOF in one alt, but not in another.
./javascript/typescript/TypeScriptParser.g4 has an EOF in one alt, but not in another.
./kirikiri-tjs/TJSParser.g4 has an EOF in one alt, but not in another.
./kotlin/kotlin-formal/KotlinParser.g4 has an EOF in one alt, but not in another.
./objc/two-step-processing/ObjectiveCPreprocessorParser.g4 has an EOF in one alt, but not in another.
./save.generated/CSharpPreprocessorParser.g4 has an EOF in one alt, but not in another.
./save.generated/Test/CSharpPreprocessorParser.g4 has an EOF in one alt, but not in another.
./smtlibv2/SMTLIBv2.g4 has an EOF in one alt, but not in another.
./sql/tsql/TSqlParser.g4 has an EOF in one alt, but not in another.
./v/V.g4 has an EOF in one alt, but not in another.
./wat/WatParser.g4 has an EOF in one alt, but not in another.

Both of these conditions are really easy to find with Trash. Here's a Bash script:

#!/usr/bin/bash
for i in `find . -name '*.g4' | grep -v Generated | grep -v examples | grep -v Lexer`
do
 count=`trparse $i -t antlr4 2> /dev/null \
  | trxgrep ' //parserRuleSpec//alternative/element[.//TOKEN_REF/text()="EOF"]/following-sibling::element' \
  | trtext -c`
 if [ "$count" != "0" ]
 then
  echo $i has an EOF usage followed by another element.
 fi
 count=`trparse $i -t antlr4 2> /dev/null \
  | trxgrep ' //labeledAlt[.//TOKEN_REF/text()="EOF" and count(../labeledAlt) > 1]' \
  | trtext -c`
 if [ "$count" != "0" ]
 then
  echo $i has an EOF in one alt, but not in another.
 fi
done

Sep 01 '22 20:09 kaby76

antlr4 antlr4 copied to clipboard

Addition of unrelated, unused parser rule to grammar causes different parsing results

antlr4
antlr4 copied to clipboard