chumsky icon indicating copy to clipboard operation
chumsky copied to clipboard

controlling choice/or errors

Open faassen opened this issue 10 months ago • 5 comments

I just spent time upgrading from alpha 6 to alpha 8. After I solved the new type puzzles, this made the massive test suite in my project to run 20k tests from 20 seconds to 12 seconds so something definitely got a lot faster!

But then 3 tests failed where they did not before. I think I got hit by this (from the documentation of the .or() combinator):

If both parsers produce errors, the combinator will attempt to select from or combine the errors to produce an error that is most likely to be useful to a human attempting to understand the problem. The exact algorithm used is left unspecified, and is not part of the crate’s semver guarantees, although regressions in error quality should be reported in the issue tracker of the main repository.

I switched to choice in the hope that this would properly prioritize the error to the behavior in alpha 6, but no such luck.

The tests that are failing are basically testing whether the right error message is returned. In the past, the resolution of or would produce the right error, but now it doesn't anymore.

I can't just give up and accept the new error, as this error behavior is defined by an external conformance test suite. (though I'm at present unable to find anything in the specification it's based on that this is the only correct parser behavior)

I tried something like the following:

priority_failure.or(choice(others))

but that didn't make any difference. Is there some way to control which error is going to be picked? I think just picking the first error in the choice/or chain might be the behavior I want.

faassen avatar Mar 17 '25 15:03 faassen

It might be useful if you show the error you want/had before along with the error you have right now. The logic involved in error prioritisation is more nuanced than simply picking one of the two errors.

zesterer avatar Mar 17 '25 15:03 zesterer

It's a real world parser of XPath so I haven't done the work to isolate it. Radically simplified, there's a parser of:

item() (an item token, ( and ) tokens) parsing a name (fails if not in lookup table)

When presented with item( this fails both parsers (and a lot of other ones too). Previously it would result in an expected/found, but now it results in a custom failed lookup error

faassen avatar Mar 17 '25 18:03 faassen

Now that we've open sourced this I can provide more information, though this not yet isolated into a nice test case so this report is just a step towards that.

After going from a6 to a8, a different error that is wrong according to the spec started to appear in two different places in the parser. One is rather deep in the weeds, the other in a parser for just the xpath type which is still sizeable but it's already partially isolated:

https://github.com/Paligo/xee/blob/main/xee-xpath-ast/src/parser/xpath_type.rs#L151

It also happened with a chained or; changing it to choice made no difference.

(in fact I'm not sure it happens in the isolated case by itself - the error is embedded in the larger grammar as part of the 'treat as' expression...)

On an expression "3 treat as item(" which doesn't terminate correctly (there should be an additional ")" at the end), previously the correct error code was returned with a6, but with a8 it starts giving the error code associated with another sub-parser (that also fails).

The error code that's specified is XPST0003, which basically an basic "expected found" from Chumsky. But now it started failing with XPST0051: unknown type. The item_type_atomic_or_union parser return an ParserError::UnknownType, I think because item does not exist as a type. And this now shadows the correct parser error.

There are also some map expressions that also start getting the wrong error elsewhere in the parser but this one I've analyzed the most.

We're tracking this here as a Xee issue: https://github.com/Paligo/xee/issues/35

faassen avatar Mar 28 '25 11:03 faassen

Hmmm. I've generally avoided making specific promises about exactly how chumsky generates errors to avoid boxing the library into a design that limits improvements to error generation in the future. I'd like to avoid starting down that slippery slope if possible, and I would recommend trying to avoid relying on the behaviour of error generation (other than the binary existence of an error) for correctness.

You said it worked previously: are you in a position to binary-search through recent commits until you find the one that introduced the problem?

zesterer avatar Mar 28 '25 13:03 zesterer

It was the update from chumsky a6 to a8 that introduced the issue. Since various APIs broke I had to fix I don't think there is a way for me to binary search through the combination.

faassen avatar Mar 29 '25 09:03 faassen