logos
logos copied to clipboard
Allow lexers to provide more user-friendly errors
At the moment errors are a single, opaque token. Is there a way to have more user-friendly, structured error values, with error recovery, instead of just returning Token::Error
? It would be really nice to have an example of this too!
What sort of interface would you like to see?
You can use range
(maybe we should rename it to span
?) to get the position of the error, and slice
to get the exact subslice that produced the error, as with every other token. Errors are guaranteed to be at least 1 byte long so that the lexer never produces an infinite loop (advance
always, well, advances).
Lexers, or parsers, especially for PLs, tend to be exempt from the "fail fast" philosophy since you usually want to accumulate errors and report them in batches to the user. Having error be just another token has the nice benefit in that you can treat errors just as any other unexpected token when writing a parser to AST, which makes it much easier to write, and it tends to have a nice performance profile since you don't have to do special case branching for error handling.
Lexers, or parsers, especially for PLs, tend to be exempt from the "fail fast" philosophy since you usually want to accumulate errors and report them in batches to the user.
Yeah, this is why I mentioned 'with error recovery' - I didn't see this mentioned in the docs. If Logos supports this then that's great!
I more want better information as to why an error occurred, so that I can produce nice, user-friendly errors using my library, codespan-reporting
.
At the moment Pikelet has a bunch of lexer errors on the master branch: https://github.com/pikelet-lang/pikelet/blob/782a8853cbaf0c50ec668ef55df798e30cea0ef6/crates/pikelet-concrete/src/parse/lexer.rs#L43-L69
In contrast, at the moment this is what I am forced to output on the next branch (which uses Logos): https://github.com/pikelet-lang/pikelet/blob/0568c6ada6cc1937774cac15372f3acde40793c9/pikelet/src/surface/lexer.rs#L174
Ah, gotcha, so what you are missing is what was the token Logos was expecting to produce but failed.
I'll have to think how to do that best, especially now that stateful enums have landed.
Speaking of, with 0.11 you can implement Logos
directly on your Token<'a>
:
https://github.com/maciejhirsz/logos/blob/4005d707e4b79dbf73b29bf572008bd81551a6dd/tests/tests/callbacks.rs#L8-L14
And there is also a spanned iterator now, which should also play nicer with what you are doing there: https://github.com/maciejhirsz/logos/blob/4005d707e4b79dbf73b29bf572008bd81551a6dd/tests/tests/simple.rs#L228-L235
Oh super nice, thanks! 😍
I'd also love to see this. Just letting the error variant be able to store something like an enum would be great.
We could use the Err
variant of a Result
from a parsing function and just store that in the error.
Example:
enum Parts<'a> {
#[regex(".+", handle_text)]
Text(&'a str),
#[error]
Error(ParsingError<'a>),
}
enum ParsingError<'a> {
InvalidCharacter(&'a str)
}
fn handle_text<'a>(lex: &mut Lexer<Parts<'a>>) -> Result<&'a str, ParsingError<'a>> {
let text = lex.slice();
if text.contains("§") {
return Err(ParsingError::InvalidCharacter("The character § can't be in the text!"));
}
Ok(text)
}
The lifetimes may be messed up a bit but i think you can understand what it is supposed to do.
@nnt0 yeah, I think I can make that work. There needs to be some way to create ParsingError
in this example for lexing errors (as opposed to callback errors).
Two options just out of the top of my head:
- Require trait bound
From<&'source str>
, so the default error can be built from slice of invalid token. - Require trait bound
FromLexer<'source, str>
(might need a different signature) that takes&mut Lexer<'source, Parts<'source>>
so you can do whatever.
Option 1. would be easier to handle, while 2. would be more flexible.
@maciejhirsz I think i'd prefer option 2 here because you'd have access to Span
. That way it's possible to report exactly where the error occured. With this you could just jump to the index instead of searching through the text.
Of course there are more benefits but thats just what I thought about first.
I also thought about making this opt-in for the folks that just want Error
and not Error(ParsingError)
. That would also prevent people from having to change their code when updating. But I'm not sure how to implement that.
Oh yeah, this will only do anything if you actually specify a field in the error variant.
I reckon having #[error] Error(&'a str)
might be a common enough case, but that can be done by providing a impl<'souce> From<'source, str> for &'source str { ... }
implementation (+ String
, Box<str>
, Cow<'source, str>
, etc).
Callbacks that return a Result<_, E>
would have to satisfy a bound E: Into<T>
where T
is the field of the error variant (again, only if one was provided).
One question which this brings up is whether there is a thought to expand #[error]
to accept a callback like #[error(|lex|...)]
and if that callback could be accessed from lex
within handle_text
above?
That would at least seem somewhat consistent with how regex
Tokens manage construction with parameters.
@ratmice ye, that would actually be a good way to supply the default constructor for the error type, might be better than using a trait.
Following up the discussion from #135, and expanding on the suggestion above here, it might be useful to have an error callback for regexes that returns error type:
#[derive(Logos)]
enum Token {
#[regex("some_complex_regex", error = |_| MyErrorType::InvalidComplexMatch)]
ComplexMatch,
#[error(|_| MyErrorType::default())]
Error(MyErrorType),
}
I suggest being able to add some special token to the regexes that triggers an error branch. Something like #[regex(r"foobar(bat|\!)"]
. This is entirely off the top of my head, but the point here is that there are cases where you can say that "if you've reached this point, I know you were trying to write a number, but you did it wrong".
I suggest being able to add some special token to the regexes that triggers an error branch.
You only need that in places where lexer can backtrack, which can only happen inside a (...)?
or (...)*
group. I just did some more reading on regex, and there is actually already syntax that we can use for this - atomic groups:
#[regex(r"[0-9]+(?>\.[0-9])?")]
This would make it so that for input 123.foo
once 123.
is read, not matching a digit on the next byte (f
) would produce an error instead of backtracking and returning a match on 123
.
I believe regex-syntax
already supports the atomic group flag, I just ignore it due to, well, ignorance :)
An important thing to note though, is that if you just turn this into the #[error]
variant, I would have to parse it anew myself to figure out which (of all my known error cases) actually happened. Ie', you're throwing away information I need.
This is assuming stateful error variant, so you can use callbacks to put the required state inside it.
That said, it also might make sense to get rid of #[error]
variant altogether, similarly to how #[end]
got removed once we changed things into an iterator.
With a C-style enum Token
, and 0-size LexerError
default, Option<Result<Token, LexerError>>
is still one byte only. Potential syntax for custom errors could then look as follows:
#[derive(Logos)]
#[logos(error = MyError)]
enum Token {
#[regex("[0-9]+(?>\.[0-9]+)?", error = |_| MyError::JunkAfterPeriod)]
Number,
}
This would be way more idiomatic, and would allow to collect iterators into like so:
let tokens: Vec<Token> = Token::lexer("3.14").collect()?;
How would you deal with 2 errors in the same case? For instance, in the number case I was converting, it dealt with junk after period, and after e
(in for instance 1e!
for instance, the !
is an error here. Just split them up into multiple refex
-es on the same case, but with different error statements?
Just split them up into multiple
refex
-es on the same case, but with different error statements?
That, or read the lex.slice()
in the error callback and and emit different error accordingly. Might have to do some manual parsing there, but that should be fine in error path.
I don't really agree on the "but that should be fine in error path", because logos has already determined that we arrived at this error path. You're essentially asking users of the library to redo the work that the library just did. You should be able to get some information out from the Lexer
of why you arrived in this error callback. Maybe some index of which atomic group failed for instance (and at what index in the source)?
I don't really agree on the "but that should be fine in error path", because logos has already determined that we arrived at this error path.
Let me rephrase that: it's fine to sacrifice error path performance for happy path performance, because error path is by definition rare.
You should be able to get some information out from the Lexer of why you arrived in this error callback.
Except Lexer has, by design, absolutely no context of why it's doing anything it's doing. All it knows is at what index it started lexing a token, and at what index it currently is. Everything else is just munching through CPU instructions compiled from a state machine graph compiled from token definitions.
The options are:
- Track extra information at runtime, which means sacrificing happy path performance for error path performance, which is almost certainly the wrong approach.
- Inject some extra context into the callback statically at compile time (index of the atomic group would be possible), which is better, but it means that every callback has to accept two parameters now (LLVM will optimize it away if it's not used, but it might be awkward on the API).
- One match expression:
match lex.slice().bytes().next_back() { Some(b".") => Error::JunkAfterPeriod, Some(b"e" | b"E") => Error::JunkAfterExponent, _ => Error::default(), }