Syntax errors should all generate exceptions, by default
Both times that I've gone to use ANTLR for a project, I've been unsettled that some parse failures throw exceptions (e.g. NoViableAltException) while others don't (and just call syntaxError in the error listeners). Here's what I end up having to do: https://github.com/cavern-social/socialmark/blob/5058649c0cda/src/main/kotlin/org/timmc/socialmark/Parse.kt#L31 -- it's extra code that feels like it should be the default for "hey, gimme a parse tree of this input".
My guess as to why this behavior exists is that some users of ANTLR want error recovery, and wouldn't want this behavior. Maybe even most do? But for the simplest use case, this seems like a major correctness risk, as it would accept invalid inputs in a way that is surprising to the developer. If there are multiple conflicting use cases, I think it makes sense to default to the safe one.
It's easy to find examples of people being surprised by this behavior and having to work around it in the same way. I know this would be a breaking change for those who haven't cleared error listeners and set their own, but I'd appreciate it if you'd consider it for 5.x, or at least include a ThrowingErrorListener.
I can't remember the details off the top of my head, but we try to recover from syntax errors when we can I when there is literally no viable alternative there is nothing we can do but throw an exception. Note that you just have to build an air handler and you can make it do what you want. There is a built in one called Bail... Something or other.
There's a BailErrorStrategy, but it's a strategy rather than a listener, and it isn't invoked for certain kinds of syntax errors.
I've made a tiny runnable example:
grammar Repro;
escape : '\\u' HEX+ ';' EOF;
HEX : [0-9A-F];
Call:
val lexer = ReproLexer(CharStreams.fromString(input))
val parser = ReproParser(CommonTokenStream(lexer))
parser.errorHandler = BailErrorStrategy()
return parser.escape().HEX().joinToString("") { it.text }
@Test fun normal() {
assertEquals(Parse.parseEscape("\\u2022;"), "2022")
}
@Test fun earlyEOF() {
val e = assertFails {
Parse.parseEscape("\\u")
}
assertTrue(e is ParseCancellationException)
}
@Test fun spuriousCharacter() {
val e = assertFails {
Parse.parseEscape("\\u01x34;")
}
}
Results: The tests normal and earlyEOF pass, but spuriousCharacter fails with:
[stderr] line 1:4 token recognition error at: 'x'
java.lang.AssertionError: Expected an exception to be thrown, but was completed successfully.
So to make that test pass and catch that sort of failure, I'd have to make an error listener and set it on the lexer (and maybe parser for good measure? not sure). Ideally, BailErrorStrategy would be the default as well—I would expect that if I create a parser, it throws on bad input, right out of the box.
I really love that ANTLR not only allows you to create parsers really easily, but also allows you to do some pretty advanced stuff. But... I think the default behavior, for someone new to the toolkit, should be the most basic "it parses or it doesn't" without any additional configuration.
Hmm...most people tell me that it should continue parsing by default and I think I agree with that; too late to change now anyway haha.
Just checked and BailErrorStrategy should force bailing in all cases. Maybe an issue with the target language ANTLR lib you're using?
Well, if that's what people want, I can only argue so much for the opposite. :-)
But that's interesting that this is unexpected -- I'd be happy to try some experiments with another target language (maybe Python) to see if it's just a bug in the generated Java parser, somehow.
Java definitely bails for me. Hmm... OH! that's a lexer not parser error. Yeah, lexer doesn't throw exceptions. it just calls
public void recover(LexerNoViableAltException e) {
can't remember. maybe just override that?
Ah, yeah, it happens in the lexer. (Heh, I'm just now seeing that ANTLRErrorStrategy has a comment in its docstring: <p>TODO: what to do about lexers</p>) So when the runtime's Lexer.java encounters that error it first notifies listeners (which is what I'm overriding/implementing in the ThrowingErrorListener I'm using) and then calls recover -- but Lexer.recover doesn't have strategies, it's just hardcoded to skip ahead.
Were you suggesting that this could be changed to use a strategy object, or something else?
You could just subclass the lexer I guess and override recover(). I must have an example somewhere...maybe in book?
I mean, I'm all set as far as my own project goes -- I've got the lexer's error listener list set to just a ThrowingErrorListener, which is working fine in my project, and that's a good deal easier than subclassing the lexer:
class ThrowingErrorListener : ANTLRErrorListener {
override fun syntaxError(...) {
throw Exception(...)
}
...
}
My request was to have that be default behavior. Failing that, it would be good to have some way to unify lexer and parser error handling.
(If that's just not on the table, I get it—but I did want to raise the issue since it seemed like a safety issue.)
I should expand on the safety issue -- basically, if I'm parsing a protocol, I really want to know if there's corruption due to some attack! Or if I'm validating an input in a security-sensitive context, I want to know that anyone else with the same grammar will get the same result (to prevent parser mismatch vulnerabilities). If I've set BailErrorStrategy on the parser then that sure looks like I've done the right thing, but there's this sneaky other error that comes from the lexer and that gets mostly ignored by default. I can run test inputs by my parser and check that it fails some invalid inputs, but if I don't get exactly the right ones, I may never notice that there's a gap in the "don't pass invalid inputs" guard. That's the kind of scenario I want to prevent.
Well the parser and lexer are different objects and different processes potentially. So notifying both objects it's not totally unexpected but we don't have a similar mechanism on the lexer so that is definitely inoptimal.
I'd like to +1 on Tim's report: having lexical errors ignored and continuing is a very bad default strategy. I was just bitten by it when my small converter from one file format to another was "working" in the sense of generating output from input, while input actually contained invalid characters.