logos icon indicating copy to clipboard operation
logos copied to clipboard

lexer consumes more of string than expected

Open Angrymanvvv opened this issue 11 months ago • 4 comments

I'm trying to write a lexer for a programming language.

State: there are 2 tokenKinds defined:

#[token("-")]
Sub,

#[regex(r"[0-9]+", |output| output.slice().parse::<i64>().ok())]
IntLiteral(i64),

The input String "-1" correctly gets lexed as Token::Sub and Token::IntLiteral(1).

If I now add a third tokenKind

#[regex(r"([+-]?(([0-9]+[eE][+-]?[0-9]+)|([0-9]*\.[0-9]+[eE][+-]?[0-9]+|[0-9]*\.[0-9]+)))", |output| {
       output.slice().parse::<f64>().unwrap()
   })
] 
FloatLiteral(f64),

that does not match a single - or a single 1, somehow, the input "-1" now gets parsed as Token::Sub and then None instead. The lexer.slice() returns "-1" after the first call to lexer.next(), indicating that the Sub token somehow consumed the 1 of the input string as well.

Edit: See comment below for the repo including a file small to replicate the behaviour.

Angrymanvvv avatar May 16 '25 17:05 Angrymanvvv

yes, I realize that parsing -1.0 as a single Float is probably a bad idea and has a lot of flaws. Nevertheless this seems like sketchy behaviour.

Angrymanvvv avatar May 17 '25 18:05 Angrymanvvv

Hi @Angrymanvvv, thanks for reporting your bug!

Could you please format the code using triple backticks? See guide here.

Also, don't include Zip files with code, just put it where, along with the output. Thanks!

jeertmans avatar May 19 '25 08:05 jeertmans

Thanks for the reply! I've created a minimal code file that repicates the behaviour, as well as an explaination and output in the corresponding readme file. See https://github.com/Angrymanvvv/Logos-Bug-MVE

Angrymanvvv avatar May 19 '25 09:05 Angrymanvvv

Fixed by #491

The following passes

mod issue_478 {
    use logos::{Logos, SpannedIter};

    #[derive(Logos, Debug, Clone, PartialEq)]
    pub enum Token {
        #[token("-")]
        Sub,

        #[regex(r"[0-9]+", |output| {
            output.slice().parse::<i64>().ok()
        })]
        IntLiteral(i64),

        #[regex(r"([+-]?(([0-9]+[eE][+-]?[0-9]+)|([0-9]*\.[0-9]+[eE][+-]?[0-9]+|[0-9]*\.[0-9]+)))", |output| {
                output.slice().parse::<f64>().ok()
        })]
        FloatLiteral(f64),
    }

    #[test]
    fn neg_int_two_tokens() {
        let mut lexer = Token::lexer("-1");
        assert_eq!(lexer.next(), Some(Ok(Token::Sub)));
        assert_eq!(lexer.next(), Some(Ok(Token::IntLiteral(1))));
        assert_eq!(lexer.next(), None);
    }

    #[test]
    fn float_literals() {
        for (input, output) in [
            ("1.0", Token::FloatLiteral(1.0)),
            (".01", Token::FloatLiteral(0.01)),
            ("3.1e-12", Token::FloatLiteral(3.1e-12)),
            ("2E3", Token::FloatLiteral(2E3)),
            ("1.5", Token::FloatLiteral(1.5)),
            ("-1.5", Token::FloatLiteral(-1.5)),
        ] {
            let token = Token::lexer(input).next();
            assert_eq!(token, Some(Ok(output)));
        }
    }
}

robot-rover avatar Nov 22 '25 00:11 robot-rover