chumsky icon indicating copy to clipboard operation
chumsky copied to clipboard

Lack of documentation on integrating and composing chumsky errors

Open pehdfms opened this issue 8 months ago • 3 comments

First of all I want to give some appreciation for the work in the recent 0.10.0 release! It happened to align with a new project I've started working on and wow it's been a treat to get to use it in practice so soon. Unfortunately this alignment does mean that I got caught in the "not quite there yet" phase of documentation, especially when related to errors.

Let's look at a simple lexer variable:

let string = none_of('"') // TODO: Handle escapes.
    .repeated()
    .to_slice()
    .map(ToString::to_string)
    .delimited_by(
        just('"'),
        just('"').map_err(|err: Rich<char>| -> Rich<char> {
            Rich::custom(err.span().clone(), "Unterminated string literal")
        }),
    )
    .boxed();

This works fine, I suppose. But err.span().clone() feels just a bit dirty (I know, I know, it's a cheap type, let's just explore a little), so I'm thinking it'd be cool to just "edit" the error label that already exists, so let's do that:

// Zoom in to map_err
|err: Rich<char>| -> Rich<char> {
    err.label_with("Unterminated string literal") // labelled() returns in place, so this feels appropriate, right?
}

Ah, but that doesn't quite work. err.label_with returns an empty tuple (as rust helpfully points out!), so it's meant to be modifying the err in place. Let's try just modifying it and returning immediately then.

|mut err: Rich<char>| -> Rich<char> {
    err.label_with("Unterminated string literal"); // Type annotation needed
    err
}

And that explodes. Let's try to annotate it naively. We're returning a Rich error which parses over char, as explained in the function definition, right? So let's try to just type it as what Rust suggests:

|mut err: Rich<char>| -> Rich<char> {
    <Rich<char> as LabelError<&str, &str>>::label_with(&mut err, "Unterminated string literal")
    err.label_with("Unterminated string literal."); // Type annotation needed
    err
}

There we go. A little bit more of a mouthful than I expected, but I suppose that's fine. I think it'd be nice to add expected / found info since we know we want a " to finish this string, so let's try the Rich::expected_found() constructor.

|mut err: Rich<char>| -> Rich<char> {
    let new_err = <Rich<char> as LabelError<&str, char>>::expected_found(
        vec!['"'],
        err.found().map(|c| Maybe::Val(*c)),
        err.span().clone(),
    );
    new_err
}

Now let's check what that error looks like on our little suite of tests:

source: "unterminated
---
ParseResult {
    output: None,
    errs: [
        found end of input at 13..13 expected something else, or ''"'',
    ],
}

Expected something else? I thought we were expecting "? Hm. And that's where I get stuck, honestly. Not sure where to proceed from here. One thing I'd like to note though is trying to come to this from the entirely opposite path:

|mut err: Rich<char>| -> Rich<char> {
    Rich::expected_found(
        vec!['"'],
        err.found().map(|c| Maybe::Val(*c)),
        err.span().clone(),
    ) // Type annotation needed.
}

Now, when I first spotted this I decided to type each individual argument to expected_found, then Rich, then expected_found itself, and nothing fixed the issue 😭 With the benefit of hindsight, the solution is "trivial", just use that type information we found earlier!

So, getting to the point: I think having docs to get people off in the right direction to start implementing errors would definitely be helpful. I'm still unsure what each recovery method does, exactly, and how to compose more complicated errors with the above API (or do I just reimplement the error trait?), but I'll try getting a PR started to put "some" information in the docs.

If anyone has any tips for how they see the "order" of topics to tackle in error recovery, that would be helpful to structure the PR. Meanwhile I'm just banging together different things I've discovered from reading the code (which I'm not familiar with, unfortunately) and experimenting with things myself.

pehdfms avatar Apr 23 '25 15:04 pehdfms

You're right that there are some holes in the API here. That said, I think your original problem can broadly be solved more simply:

let string = none_of('"') // TODO: Handle escapes.
    .repeated()
    .to_slice()
    .map(ToString::to_string)
    .delimited_by(
        just('"'),
        just('"').labelled("string literal terminator"),
    );

Rich::custom is intended for non-syntactic errors (i.e: out-of-range integer literals and similar such things). Labelling is probably more effective as a way to communicate what you're looking for.

Regarding API holes, I can think of something that would be worth adding:

// To allow inline mapping of the error in `map_err`
err.with_reason(RichReason::Custom(...)) -> Rich

The more I think about it, the more I increasingly feel that map_err doesn't really have a place in 'modern' chumsky: the way error information is handled internally means that its exact behaviour is somewhat non-trivial to understand (it's not always clear exactly what error is being mapped, since chumsky also handles recovered errors), and most uses of it are probably better served with .labelled or the addition of extra error-manipulating combinators.

zesterer avatar Apr 24 '25 07:04 zesterer

Oh, awesome! This made me explore labelling things on a more granular basis than I was doing before and it definitely improved the error reporting "for free"

I assume this would be the best practice for end-of-line comments as well?

let comment = any().repeated().delimited_by(just(';'), newline());

Anyways - since I already have this thread up, I've been having a little bit of trouble with handling empty input, here's a reproduction case:

let token = just('+').to(Token::Plus);
token.padded().repeated().collect().then_ignore(end())

This compiles, but if I try to test it on " \n " I get the following output:

ParseResult {
    output: None,
    errs: [
        found end of input at 3..3 expected '+',
    ],
}

I can't really tell how this may be different from some of the examples. Honestly, I'd think token.padded().repeated().collect() would have been enough? But that also has the same issue. I'm definitely missing something dumb here so apologies 😄

pehdfms avatar Apr 24 '25 16:04 pehdfms

Ah, figured it out by just spamming padding around until it "worked" :)

I currently have this little definition for a Lisp (feel free to jump down to the last two lines for the issue):

/// Intermediate representation between the lexer and the parser for language-specific tokens.
#[derive(Clone, Debug, PartialEq)]
pub enum Token {
    // Single-character tokens.
    /// Left parenthesis.
    LeftParen,
    /// Right parenthesis.
    RightParen,
    /// Marks the beginning of a quoted expression.
    Quote,

    // Literals, ordered by precedence.
    /// Floating-point number.
    Number(f64),
    /// Single / Multiline string.
    String(String),
    /// Mnemonic (#vau, #pi, #theta)
    Mnemonic(String),
    /// Symbol (identifier)
    Symbol(String),
}

pub fn lexer<'src>() -> impl Parser<'src, &'src str, Vec<Token>, extra::Err<Rich<'src, char>>> {
    // Atoms.
    let digits = text::int(10).labelled("digits").boxed();
    let frac = just(".")
        .then(digits.clone())
        .labelled("fractional part")
        .boxed();
    let unreserved_character = none_of("()'\"#;")
        .filter(|c: &char| !c.is_whitespace())
        .labelled("unreserved character")
        .boxed();

    // Literals.
    let number = digits
        .then(frac.or_not())
        .to_slice()
        .from_str()
        .unwrapped()
        .map(Token::Number)
        .labelled("number")
        .boxed();

    let string = none_of('"')
        .repeated()
        .to_slice()
        .map(ToString::to_string)
        .map(Token::String)
        .delimited_by(just('"'), just('"').labelled("string terminator"))
        .labelled("string")
        .boxed();

    let mnemonic = just("#")
        .ignore_then(
            unreserved_character
                .clone()
                .labelled("reserved character")
                .repeated()
                .at_least(1)
                .collect(),
        )
        .map(Token::Mnemonic)
        .labelled("mnemonic")
        .boxed();

    let symbol = unreserved_character
        .repeated()
        .at_least(1)
        .collect()
        .map(Token::Symbol)
        .labelled("symbol")
        .boxed();

    let token = choice((
        just("(").to(Token::LeftParen),
        just(")").to(Token::RightParen),
        just("'").to(Token::Quote),
        number,
        string,
        mnemonic,
        symbol,
    ));

    let comment = just(";")
        .then(none_of("\n").repeated())
        .ignored()
        .then_ignore(end().or_not())
        .labelled("comment")
        .boxed();

    token
        .padded()
        .padded_by(comment.or_not()) // found eof on "; this is a comment"
        .repeated()
        .collect()
        .padded()
}

However, it fails on a source composed by a single comment: ; this is a comment Haven't been able to figure out how to debug that. I've read around the issues and it seems like the mechanism intended for debugging is still a WIP. Is there anything I can help with on that end?

pehdfms avatar Apr 24 '25 22:04 pehdfms