logos icon indicating copy to clipboard operation
logos copied to clipboard

Allow lexers to provide more user-friendly errors

Open brendanzab opened this issue 4 years ago • 19 comments

At the moment errors are a single, opaque token. Is there a way to have more user-friendly, structured error values, with error recovery, instead of just returning Token::Error? It would be really nice to have an example of this too!

brendanzab avatar Apr 04 '20 05:04 brendanzab

What sort of interface would you like to see?

You can use range (maybe we should rename it to span?) to get the position of the error, and slice to get the exact subslice that produced the error, as with every other token. Errors are guaranteed to be at least 1 byte long so that the lexer never produces an infinite loop (advance always, well, advances).

Lexers, or parsers, especially for PLs, tend to be exempt from the "fail fast" philosophy since you usually want to accumulate errors and report them in batches to the user. Having error be just another token has the nice benefit in that you can treat errors just as any other unexpected token when writing a parser to AST, which makes it much easier to write, and it tends to have a nice performance profile since you don't have to do special case branching for error handling.

maciejhirsz avatar Apr 04 '20 06:04 maciejhirsz

Lexers, or parsers, especially for PLs, tend to be exempt from the "fail fast" philosophy since you usually want to accumulate errors and report them in batches to the user.

Yeah, this is why I mentioned 'with error recovery' - I didn't see this mentioned in the docs. If Logos supports this then that's great!

I more want better information as to why an error occurred, so that I can produce nice, user-friendly errors using my library, codespan-reporting.

At the moment Pikelet has a bunch of lexer errors on the master branch: https://github.com/pikelet-lang/pikelet/blob/782a8853cbaf0c50ec668ef55df798e30cea0ef6/crates/pikelet-concrete/src/parse/lexer.rs#L43-L69

In contrast, at the moment this is what I am forced to output on the next branch (which uses Logos): https://github.com/pikelet-lang/pikelet/blob/0568c6ada6cc1937774cac15372f3acde40793c9/pikelet/src/surface/lexer.rs#L174

brendanzab avatar Apr 07 '20 00:04 brendanzab

Ah, gotcha, so what you are missing is what was the token Logos was expecting to produce but failed.

I'll have to think how to do that best, especially now that stateful enums have landed.

Speaking of, with 0.11 you can implement Logos directly on your Token<'a>: https://github.com/maciejhirsz/logos/blob/4005d707e4b79dbf73b29bf572008bd81551a6dd/tests/tests/callbacks.rs#L8-L14

And there is also a spanned iterator now, which should also play nicer with what you are doing there: https://github.com/maciejhirsz/logos/blob/4005d707e4b79dbf73b29bf572008bd81551a6dd/tests/tests/simple.rs#L228-L235

maciejhirsz avatar Apr 07 '20 06:04 maciejhirsz

Oh super nice, thanks! 😍

brendanzab avatar Apr 07 '20 08:04 brendanzab

I'd also love to see this. Just letting the error variant be able to store something like an enum would be great.

We could use the Err variant of a Result from a parsing function and just store that in the error.

Example:

enum Parts<'a> {
    #[regex(".+", handle_text)]
    Text(&'a str),

    #[error]
    Error(ParsingError<'a>),
}

enum ParsingError<'a> {
    InvalidCharacter(&'a str)
}

fn handle_text<'a>(lex: &mut Lexer<Parts<'a>>) -> Result<&'a str, ParsingError<'a>> {
    let text = lex.slice();
    if text.contains("§") {
        return Err(ParsingError::InvalidCharacter("The character § can't be in the text!"));
    }

    Ok(text)
}

The lifetimes may be messed up a bit but i think you can understand what it is supposed to do.

nnt0 avatar Apr 20 '20 10:04 nnt0

@nnt0 yeah, I think I can make that work. There needs to be some way to create ParsingError in this example for lexing errors (as opposed to callback errors).

Two options just out of the top of my head:

  1. Require trait bound From<&'source str>, so the default error can be built from slice of invalid token.
  2. Require trait bound FromLexer<'source, str> (might need a different signature) that takes &mut Lexer<'source, Parts<'source>> so you can do whatever.

Option 1. would be easier to handle, while 2. would be more flexible.

maciejhirsz avatar Apr 20 '20 10:04 maciejhirsz

@maciejhirsz I think i'd prefer option 2 here because you'd have access to Span. That way it's possible to report exactly where the error occured. With this you could just jump to the index instead of searching through the text.

Of course there are more benefits but thats just what I thought about first.

I also thought about making this opt-in for the folks that just want Error and not Error(ParsingError). That would also prevent people from having to change their code when updating. But I'm not sure how to implement that.

nnt0 avatar Apr 20 '20 11:04 nnt0

Oh yeah, this will only do anything if you actually specify a field in the error variant.

I reckon having #[error] Error(&'a str) might be a common enough case, but that can be done by providing a impl<'souce> From<'source, str> for &'source str { ... } implementation (+ String, Box<str>, Cow<'source, str>, etc).

Callbacks that return a Result<_, E> would have to satisfy a bound E: Into<T> where T is the field of the error variant (again, only if one was provided).

maciejhirsz avatar Apr 20 '20 12:04 maciejhirsz

One question which this brings up is whether there is a thought to expand #[error] to accept a callback like #[error(|lex|...)] and if that callback could be accessed from lex within handle_text above?

That would at least seem somewhat consistent with how regex Tokens manage construction with parameters.

ratmice avatar Apr 21 '20 12:04 ratmice

@ratmice ye, that would actually be a good way to supply the default constructor for the error type, might be better than using a trait.

maciejhirsz avatar Apr 21 '20 14:04 maciejhirsz

Following up the discussion from #135, and expanding on the suggestion above here, it might be useful to have an error callback for regexes that returns error type:

#[derive(Logos)]
enum Token {
    #[regex("some_complex_regex", error = |_| MyErrorType::InvalidComplexMatch)]
    ComplexMatch,

    #[error(|_| MyErrorType::default())]
    Error(MyErrorType),
}

maciejhirsz avatar Apr 26 '20 10:04 maciejhirsz

I suggest being able to add some special token to the regexes that triggers an error branch. Something like #[regex(r"foobar(bat|\!)"]. This is entirely off the top of my head, but the point here is that there are cases where you can say that "if you've reached this point, I know you were trying to write a number, but you did it wrong".

Alxandr avatar Apr 26 '20 11:04 Alxandr

I suggest being able to add some special token to the regexes that triggers an error branch.

You only need that in places where lexer can backtrack, which can only happen inside a (...)? or (...)* group. I just did some more reading on regex, and there is actually already syntax that we can use for this - atomic groups:

#[regex(r"[0-9]+(?>\.[0-9])?")]

This would make it so that for input 123.foo once 123. is read, not matching a digit on the next byte (f) would produce an error instead of backtracking and returning a match on 123.

I believe regex-syntax already supports the atomic group flag, I just ignore it due to, well, ignorance :)

maciejhirsz avatar Apr 26 '20 12:04 maciejhirsz

An important thing to note though, is that if you just turn this into the #[error] variant, I would have to parse it anew myself to figure out which (of all my known error cases) actually happened. Ie', you're throwing away information I need.

Alxandr avatar Apr 26 '20 12:04 Alxandr

This is assuming stateful error variant, so you can use callbacks to put the required state inside it.

That said, it also might make sense to get rid of #[error] variant altogether, similarly to how #[end] got removed once we changed things into an iterator.

With a C-style enum Token, and 0-size LexerError default, Option<Result<Token, LexerError>> is still one byte only. Potential syntax for custom errors could then look as follows:

#[derive(Logos)]
#[logos(error = MyError)]
enum Token {
    #[regex("[0-9]+(?>\.[0-9]+)?", error = |_| MyError::JunkAfterPeriod)]
    Number,
}

This would be way more idiomatic, and would allow to collect iterators into like so:

let tokens: Vec<Token> = Token::lexer("3.14").collect()?;

maciejhirsz avatar Apr 26 '20 13:04 maciejhirsz

How would you deal with 2 errors in the same case? For instance, in the number case I was converting, it dealt with junk after period, and after e (in for instance 1e! for instance, the ! is an error here. Just split them up into multiple refex-es on the same case, but with different error statements?

Alxandr avatar Apr 26 '20 14:04 Alxandr

Just split them up into multiple refex-es on the same case, but with different error statements?

That, or read the lex.slice() in the error callback and and emit different error accordingly. Might have to do some manual parsing there, but that should be fine in error path.

maciejhirsz avatar Apr 26 '20 15:04 maciejhirsz

I don't really agree on the "but that should be fine in error path", because logos has already determined that we arrived at this error path. You're essentially asking users of the library to redo the work that the library just did. You should be able to get some information out from the Lexer of why you arrived in this error callback. Maybe some index of which atomic group failed for instance (and at what index in the source)?

Alxandr avatar Apr 26 '20 15:04 Alxandr

I don't really agree on the "but that should be fine in error path", because logos has already determined that we arrived at this error path.

Let me rephrase that: it's fine to sacrifice error path performance for happy path performance, because error path is by definition rare.

You should be able to get some information out from the Lexer of why you arrived in this error callback.

Except Lexer has, by design, absolutely no context of why it's doing anything it's doing. All it knows is at what index it started lexing a token, and at what index it currently is. Everything else is just munching through CPU instructions compiled from a state machine graph compiled from token definitions.

The options are:

  • Track extra information at runtime, which means sacrificing happy path performance for error path performance, which is almost certainly the wrong approach.
  • Inject some extra context into the callback statically at compile time (index of the atomic group would be possible), which is better, but it means that every callback has to accept two parameters now (LLVM will optimize it away if it's not used, but it might be awkward on the API).
  • One match expression:
    match lex.slice().bytes().next_back() {
        Some(b".") => Error::JunkAfterPeriod,
        Some(b"e" | b"E") => Error::JunkAfterExponent,
        _ => Error::default(),
    }
    

maciejhirsz avatar Apr 26 '20 16:04 maciejhirsz