antlr4
antlr4 copied to clipboard
Addition of unrelated, unused parser rule to grammar causes different parsing results
This is related to a SO question https://stackoverflow.com/q/73522066/4779853
The grammar is poorly written as an EOF-start rule, but that is irrelevant because parsing results should always be consistent when a completely unrelated, unused parser rule is added or deleted. Therefore there is something wrong in the runtime.
grammar my;
st: 'H' hd | EOF ;
hd: 'D' d | 'C' c | st ;
d: hd ;
c: 'D' c | hd ;
s1: 'D' s1 | c ;
// p: hd ;
SKP: [ \t\r\n]+ -> skip;
Input
H C D C C D
Commented rule p
causes parser to work one way, uncomment rule p
works another way. It is illogical for a rule that is unused to influence a parse because it's never used in a derivation. It must be investigated.
It looks like the similar bug was already reported: https://github.com/antlr/antlr4/issues/2179 But it's about weird error instead of different parsing result.
A random unused rule can cause FOLLOW sets to be different if I recall. Yeah, this grammar is pretty messed up. The issue relates to whether or not lookahead "falls off the edge" ever. Because of the tangle of rules (ref to hd
), there is no rule w/o a follow set. adding p
means adding a place where hd
can be followed by nothing. W/o swapping in more knowledge from cold storage I can't explain any better.
Ah. i think it's the LL(1) analyzer. See comments on _LOOK()
. EOF can be added to FOLLOW set depending on hitting end of a rule. Not saying code is correct exactly but I can't remember enough to resolve.
This is legal in Antlr. What does it mean?
grammar T3;
s : 'a' EOF 'c' | 'b' ;
I'm not sure whether it's a good idea to allow rules with EOF that
- are in one alt and missing in another;
- or allow symbols to follow an EOF.
It's not clear what these mean, and Antlr has a problem in grammar
my
where commented and uncommented rulep
yields two different trees.
If it is allowed, it must be crystal clear what they mean. (I have an xpath that should flag such rules, but trxgrep has a bug that I now have to fix.)
s : 'a' EOF 'c' | 'b' ;
I think such code should be disallowed. EOF
might be placed only at the end position of the rule. EOF_CLOSURE
error should be extended for covering such cases.
True, but then you'd also want to disable combination of rules that have an EOF not at the end. Tricky imho...
are in one alt and missing in another;
~~Why do you think it should be disallowed? It's quite real code.~~
Yes, it looks like it also doesn't make sense. If it's encountered, the grammar should be rewritten to prevent potential weird behaviour and errors.
or allow symbols to follow an EOF.
Agree with that. It's nonsense since the part after EOF
is unreachable.
True, but then you'd also want to disable combination of rules that have an EOF not at the end. Tricky imho...
At least not ending EOF
can be disallowed within one rule, it's easy. Detecting not ending EOF
within the entire grammar is also possible. To implement it, containsEOF
flag should be considered for every rule. By the way, I've implemented the similar warnings but for beginning position: PREDICATE_AT_THE_BEGINNING_OF_LEXER_RULE_DEGRADES_PERFORMANCE
.
I suggest introducing the new errors together with warnings from https://github.com/antlr/antlr4/pull/3626, they are processed in the same way.
https://github.com/antlr/antlr4/discussions/3763 -- we could just take it out of the hands of users. :)
Hi guys...any change to the secret sauce parsing algorithm would require me to fully understand it again. Not sure I am willing to dig deep enough. Changing the grammar semantics for EOF is less risky but would likely break backward compatibility. Also I've just never had a problem putting EOF on end of a start rule or leaving it off so I can parse partial phrases.
I'd rather we focus on larger issues than grammar warnings that I might find too risky to merge. Maybe help with antlr4-lab?
Some good news.
- There are no rules with EOF followed by another element in grammars-v4.
- There are a few grammars where an EOF occurs on one alt but not in another. I'll take a look at these.
./cool/COOL.g4 has an EOF in one alt, but not in another.
./csharp/Antlr4cs/CSharpPreprocessorParser.g4 has an EOF in one alt, but not in another.
./csharp/CSharp/CSharpPreprocessorParser.g4 has an EOF in one alt, but not in another.
./csharp/CSharpPreprocessorParser.g4 has an EOF in one alt, but not in another.
./golang/Go/GoParser.g4 has an EOF in one alt, but not in another.
./golang/GoParser.g4 has an EOF in one alt, but not in another.
./hypertalk/HyperTalk.g4 has an EOF in one alt, but not in another.
./java/java/JavaParser.g4 has an EOF in one alt, but not in another.
./javadoc/JavadocParser.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/CSharp/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/CSharpSharwell/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/Go/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/JavaScript/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/Python/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/TypeScript/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/javascript/JavaScriptParser.g4 has an EOF in one alt, but not in another.
./javascript/jsx/JavaScriptParser.g4 has an EOF in one alt, but not in another.
./javascript/typescript/TypeScriptParser.g4 has an EOF in one alt, but not in another.
./kirikiri-tjs/TJSParser.g4 has an EOF in one alt, but not in another.
./kotlin/kotlin-formal/KotlinParser.g4 has an EOF in one alt, but not in another.
./objc/two-step-processing/ObjectiveCPreprocessorParser.g4 has an EOF in one alt, but not in another.
./save.generated/CSharpPreprocessorParser.g4 has an EOF in one alt, but not in another.
./save.generated/Test/CSharpPreprocessorParser.g4 has an EOF in one alt, but not in another.
./smtlibv2/SMTLIBv2.g4 has an EOF in one alt, but not in another.
./sql/tsql/TSqlParser.g4 has an EOF in one alt, but not in another.
./v/V.g4 has an EOF in one alt, but not in another.
./wat/WatParser.g4 has an EOF in one alt, but not in another.
- Both of these conditions are really easy to find with Trash. Here's a Bash script:
#!/usr/bin/bash
for i in `find . -name '*.g4' | grep -v Generated | grep -v examples | grep -v Lexer`
do
count=`trparse $i -t antlr4 2> /dev/null \
| trxgrep ' //parserRuleSpec//alternative/element[.//TOKEN_REF/text()="EOF"]/following-sibling::element' \
| trtext -c`
if [ "$count" != "0" ]
then
echo $i has an EOF usage followed by another element.
fi
count=`trparse $i -t antlr4 2> /dev/null \
| trxgrep ' //labeledAlt[.//TOKEN_REF/text()="EOF" and count(../labeledAlt) > 1]' \
| trtext -c`
if [ "$count" != "0" ]
then
echo $i has an EOF in one alt, but not in another.
fi
done