logos
logos copied to clipboard
Regex and token prefix causes error match
I'm trying to lex the Python language, which has the keyword not
and the separate keyword not in
(which may have arbitrary spaces between the not
and in
). My attempt at that is:
use logos::Logos;
#[derive(Logos, Debug, PartialEq, Eq)]
enum PythonToken {
#[token("not")]
Not,
#[regex("not[ ]+in")]
NotIn,
#[regex(" +", logos::skip)]
#[error]
Error,
}
#[test]
fn python_token_test() {
let mut lexer = PythonToken::lexer("not not");
let mut result = Vec::new();
while let Some(x) = lexer.next() {
result.push(x);
}
assert_eq!(result.as_slice(), &[PythonToken::Not, PythonToken::Not])
}
However, this gives the result [Error, Not]
, when the first not could be validly parsed as a single Not
. Is there any way to work around this issue? Or is the example I've provided buggy in some way?
Try not([ ]+in)?
Any way to do that and have them end up with different tokens? Or is Not{has_in: bool}
the best approach?
Ah yes, since they can have conflicting matches you can force the short one to be correct with #[token("not", priority = 100)]
The priority doesn't appear to make any difference. Moving to not([ ]+in)?
and using a single NotOrIsNot(bool)
also doesn't work - if I replace the two Not
and NotIn
cases with:
#[regex("not([ ]+in)?", |lex| lex.slice().len() > 3)]
NotOrNotIn(bool),
I get [Error, NotOrNotIn(false)]
for the input not not
.
Ok, that's a bug, looks similar to #181 too.
Thanks for the info, that does look related. To fix the problem I've switched to treating them as two separate keywords, and then merging them in the parser I used afterwards.
I'm hitting a similar issue. I'm using a wrapper around Logos to handle modal lexing (to support string interpolation for example). I could reduce the error to the following example:
use logos::Logos;
#[derive(Logos, Debug, PartialEq)]
enum Token {
#[token("[^ab]+")]
Literal,
#[regex("a")]
SingleA,
#[regex("a(b+)c")]
Special,
#[regex("b+")]
Bs,
#[error]
Error,
}
fn main() {
let mut lex = Token::lexer("ab");
assert_eq!(lex.next(), Some(Token::SingleA));
assert_eq!(lex.next(), Some(Token::Bs));
}
thread 'main' panicked at 'assertion failed: `(left == right)`
left: `Some(Error)`,
right: `Some(SingleA)`', src/main.rs:26:5
One confirmation that it has to do with backtracking is that if we replace SingleA
with a token that matches more characters, it is still ambiguous, but it doesn't fail. I imagine it's because when it fails in the second case, it just has to select the AThenBs
token, while in the first case, it needs to go 1 character back in the stream.
use logos::Logos;
#[derive(Logos, Debug, PartialEq)]
enum Token {
#[token("[^ab]+")]
Literal,
#[regex("ab+")]
AThenBs,
#[regex("a(b+)c")]
Special,
#[regex("b+")]
Bs,
#[error]
#[regex(r"[ \t\n\f]+", logos::skip)]
Error,
}
fn main() {
let mut lex = Token::lexer("ab");
assert_eq!(lex.next(), Some(Token::AThenBs));
}