logos icon indicating copy to clipboard operation
logos copied to clipboard

Regex and token prefix causes error match

Open ndmitchell opened this issue 4 years ago • 7 comments

I'm trying to lex the Python language, which has the keyword not and the separate keyword not in (which may have arbitrary spaces between the not and in). My attempt at that is:

use logos::Logos;

#[derive(Logos, Debug, PartialEq, Eq)]
enum PythonToken {
    #[token("not")]
    Not,
    #[regex("not[ ]+in")]
    NotIn,
    #[regex(" +", logos::skip)]
    #[error]
    Error,
}

#[test]
fn python_token_test() {
    let mut lexer = PythonToken::lexer("not not");
    let mut result = Vec::new();
    while let Some(x) = lexer.next() {
        result.push(x);
    }
    assert_eq!(result.as_slice(), &[PythonToken::Not, PythonToken::Not])
}

However, this gives the result [Error, Not], when the first not could be validly parsed as a single Not. Is there any way to work around this issue? Or is the example I've provided buggy in some way?

ndmitchell avatar Feb 01 '21 20:02 ndmitchell

Try not([ ]+in)?

maciejhirsz avatar Feb 01 '21 20:02 maciejhirsz

Any way to do that and have them end up with different tokens? Or is Not{has_in: bool} the best approach?

ndmitchell avatar Feb 01 '21 20:02 ndmitchell

Ah yes, since they can have conflicting matches you can force the short one to be correct with #[token("not", priority = 100)]

maciejhirsz avatar Feb 01 '21 20:02 maciejhirsz

The priority doesn't appear to make any difference. Moving to not([ ]+in)? and using a single NotOrIsNot(bool) also doesn't work - if I replace the two Not and NotIn cases with:

    #[regex("not([ ]+in)?", |lex| lex.slice().len() > 3)]
    NotOrNotIn(bool),

I get [Error, NotOrNotIn(false)] for the input not not.

ndmitchell avatar Feb 01 '21 20:02 ndmitchell

Ok, that's a bug, looks similar to #181 too.

maciejhirsz avatar Feb 01 '21 20:02 maciejhirsz

Thanks for the info, that does look related. To fix the problem I've switched to treating them as two separate keywords, and then merging them in the parser I used afterwards.

ndmitchell avatar Feb 01 '21 21:02 ndmitchell

I'm hitting a similar issue. I'm using a wrapper around Logos to handle modal lexing (to support string interpolation for example). I could reduce the error to the following example:

use logos::Logos;

#[derive(Logos, Debug, PartialEq)]
enum Token {
    #[token("[^ab]+")]
    Literal,

    #[regex("a")]
    SingleA,

    #[regex("a(b+)c")]
    Special,

    #[regex("b+")]
    Bs,

    #[error]
    Error,
}

fn main() {
    let mut lex = Token::lexer("ab");

    assert_eq!(lex.next(), Some(Token::SingleA));
    assert_eq!(lex.next(), Some(Token::Bs));
}
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `Some(Error)`,
 right: `Some(SingleA)`', src/main.rs:26:5

One confirmation that it has to do with backtracking is that if we replace SingleA with a token that matches more characters, it is still ambiguous, but it doesn't fail. I imagine it's because when it fails in the second case, it just has to select the AThenBs token, while in the first case, it needs to go 1 character back in the stream.

use logos::Logos;

#[derive(Logos, Debug, PartialEq)]
enum Token {
    #[token("[^ab]+")]
    Literal,

    #[regex("ab+")]
    AThenBs,

    #[regex("a(b+)c")]
    Special,

    #[regex("b+")]
    Bs,

    #[error]
    #[regex(r"[ \t\n\f]+", logos::skip)]
    Error,
}

fn main() {
    let mut lex = Token::lexer("ab");

    assert_eq!(lex.next(), Some(Token::AThenBs));
}

yannham avatar Jul 02 '21 15:07 yannham