chumsky icon indicating copy to clipboard operation
chumsky copied to clipboard

Empty repeated() produces a backwards span

Open buildwithzephyr opened this issue 4 months ago • 4 comments

When repeated() repeats 0 times, the reported span is backwards - it starts at the start of the next span, and ends at the end of the previous span.

Reproduction:

use chumsky::{extra::Err, input::BorrowInput, prelude::*};

#[derive(Debug)]
struct Node<T> {
    data: T,
    span: SimpleSpan,
}

#[derive(Debug, PartialEq, Clone, Copy)]
enum Token {
    LeftCurlyBrace,
    RightCurlyBrace,
    A,
}

#[derive(Debug)]
struct Braces<T> {
    left_brace: Node<Token>,
    contents: Node<T>,
    right_brace: Node<Token>,
}

macro_rules! node {
    ($parser:expr) => {
        $parser.map_with(|x, extra| Node {
            data: x,
            span: extra.span(),
        })
    };
}

type MyErr<'src> = Err<Rich<'src, Token, SimpleSpan>>;

fn braces<'src, I, O>(
    parser: impl Parser<'src, I, O, MyErr<'src>>,
) -> impl Parser<'src, I, Braces<O>, MyErr<'src>>
where
    I: BorrowInput<'src, Token = Token, Span = SimpleSpan>,
{
    node!(just(Token::LeftCurlyBrace))
        .then(node!(parser))
        .then(node!(just(Token::RightCurlyBrace)))
        .map(|((left_brace, contents), right_brace)| Braces {
            left_brace,
            contents,
            right_brace,
        })
}

#[derive(Debug)]
struct A;

fn root<'src, I>() -> impl Parser<'src, I, Braces<Vec<Node<A>>>, MyErr<'src>>
where
    I: BorrowInput<'src, Token = Token, Span = SimpleSpan>,
{
    let a = just(Token::A).map(|_| A);
    let many_as = node!(a).repeated().collect();
    braces(many_as)
}

fn tuple_identity((t, s): &(Token, SimpleSpan)) -> (&Token, &SimpleSpan) {
    (t, s)
}

fn main() {
    let tokenstream = vec![
        (Token::LeftCurlyBrace, SimpleSpan::from(0..1)),
        (Token::RightCurlyBrace, SimpleSpan::from(3..4)),
    ];
    let result = root()
        .parse(tokenstream.map((4..4).into(), tuple_identity))
        .into_result();
    println!("{:#?}", result);
}

When run, this produces this output:

Ok(
    Braces {
        left_brace: Node {
            data: LeftCurlyBrace,
            span: 0..1,
        },
        contents: Node {
            data: [],
            span: 3..1,
        },
        right_brace: Node {
            data: RightCurlyBrace,
            span: 3..4,
        },
    },
)

Notice how the span of the middle node is 3..1.

If I change the token stream to make the contents of the braces non-empty, the output looks fine.

Ok(
    Braces {
        left_brace: Node {
            data: LeftCurlyBrace,
            span: 0..1,
        },
        contents: Node {
            data: [
                Node {
                    data: A,
                    span: 1..2,
                },
                Node {
                    data: A,
                    span: 2..3,
                },
            ],
            span: 1..3,
        },
        right_brace: Node {
            data: RightCurlyBrace,
            span: 3..4,
        },
    },
)

This may be something that only happens on a MappedInput. If spans are consecutive, then the empty node has span n..n which is fine since that's a zero-length span. But when the lexer produces non-consecutive spans (due to skipping whitespace), we get this bug.

I guess it's sort of a matter of debate as to what the correct span should be. Is it accurate to say that the span was 1..3 when nothing was matched? Should it be 1..1 to reflect the previous span? 3..3 to reflect the next span? I'm not sure, but 3..1 seems wrong as ends before it begins - and can create downstream bugs.

buildwithzephyr avatar Sep 01 '25 00:09 buildwithzephyr

This is very strange indeed! I'll try to find some time to look at this in more detail soon.

zesterer avatar Sep 12 '25 22:09 zesterer

Run into this today. Looking at the code, the issue seems to be in MappedInput's implementation of Input::span():

https://github.com/zesterer/chumsky/blob/e350fc66d99c9695b3156becca4e5fc34b8b40e4/src/input.rs#L637-L649

Assuming there are still more tokens to read, this takes the span of the element at the position of the start cursor of the inner input and maps it to extract its span. It then creates a range from the start of that span to the end of the span of the element immediately before the end cursor (which is stored in the cursor itself, assuming the cursor doesn't point to the first element). Assuming sensible monotonically increasing valid spans without overlaps in the input:

  • If the start cursor points to an element strictly before the one pointed by the end cursor, the result is always valid. In this case, the previous element from the end cursor must be either the one pointed to by the start cursor or appear after that. In both cases, the end offset of its span must be grater than the start offset of the span of the start element
  • If the start and end cursors point to the same element however, the new span starts at the start of the element those cursors point to, but ends at the end of their previous element. This is only valid if the last element ends at the same position as the current element starts, which is not guaranteed and depends on the user input to parse

When you try to get the span of a parser with .map_with() and MapExtra::span(), this is what ends up running:

https://github.com/zesterer/chumsky/blob/e350fc66d99c9695b3156becca4e5fc34b8b40e4/src/combinator.rs#L390-L396 https://github.com/zesterer/chumsky/blob/e350fc66d99c9695b3156becca4e5fc34b8b40e4/src/input.rs#L1848-L1869

If the sub-parser of MapWith consumes no input, MapWith ends up creating a MapExtra with the same cursor as before and after, so when it ends up calling I::span() it will pass a cursor range made up of 2 equal cursors, and thus be in the second case above. Same thing happens with TryMapWith or InputRef::span_since() within custom()

ALVAROPING1 avatar Nov 15 '25 00:11 ALVAROPING1

Can confirm, I also just hit this case.

Jezza avatar Nov 22 '25 18:11 Jezza

Same problem here.

uros-5 avatar Nov 25 '25 09:11 uros-5