Error with overlapping token definitions
I'm getting a strange error when a regex could match the prefix of another regex. Maybe. I just don't know what the problem is. Here's a simplified case:
#[derive(Logos, Debug, PartialEq)]
#[logos(skip r".|[\r\n]")] // skip everything not recognized
pub enum LogosToken {
// any letter except capital Z
#[regex(r"[a-zA-Y]+", priority = 3)]
WordExceptZ,
// any number
#[regex(r"[0-9]+", priority = 3)]
Number,
/*
This expression is:
(letter or number)* [Z] (letter or number)*
In other words, a token with any number of letters or numbers,
including at least one capital Z.
*/
#[regex(r"[a-zA-Z0-9]*[Z][a-zA-Z0-9]*", priority = 3)]
TermWithZ,
}
#[pg_extern]
fn test_logos() {
let mut lex = LogosToken::lexer("hello 42world fooZfoo");
while let Some(result) = lex.next() {
let slice = lex.slice();
println!("{:?} {:?}", slice, result);
}
}
This generates:
"hello" Ok(WordExceptZ)
"42world" Err(())
"fooZfoo" Ok(TermWithZ)
If I replace the regex over TermWithZ with #[regex(r"Z", priority = 3)], I get:
"hello" Ok(WordExceptZ)
"42" Ok(Number)
"world" Ok(WordExceptZ)
"foo" Ok(WordExceptZ)
"Z" Ok(TermWithZ)
"foo" Ok(WordExceptZ)
The "42world" is getting recognized correctly as a number and word.
What I don't understand is, why does the first TermWithZ regex mess up the recognition of "42world"? It doesn't contain a Z, so TermWithZ should ignore it completely and let the first two variants do their job.
After looking at this a bit more, I'm guessing I'm running into the "no backtracking" limitation.
Any workaround suggestions are welcome.
Hello! Can you please run in debugging mode and printout the corresponding graph?
See https://logos.maciej.codes/debugging.html.
@jeertmans
{
1: ::<skip> (<skip>),
2: [80-BF] ⇒ 1,
3: [A0-BF][80-BF] ⇒ 1,
4: [80-BF][80-BF] ⇒ 1,
5: [80-9F][80-BF] ⇒ 1,
6: [90-BF][80-BF][80-BF] ⇒ 1,
7: [80-BF][80-BF][80-BF] ⇒ 1,
8: [80-8F][80-BF][80-BF] ⇒ 1,
10: ::WordExceptZ,
13: ::Number,
16: ::TermWithZ,
17: {
[0-9] ⇒ 17,
[A-Z] ⇒ 17,
[a-z] ⇒ 17,
_ ⇒ 16,
},
18: Z ⇒ 17,
19: {
[0-9] ⇒ 19,
[A-Z] ⇒ 19,
[a-z] ⇒ 19,
_ ⇒ 18,
},
22: {
[0-9] ⇒ 24,
[A-Y] ⇒ 19,
Z ⇒ 25,
[a-z] ⇒ 19,
_ ⇒ 13,
},
24: {
[0-9] ⇒ 24,
[A-Y] ⇒ 19,
Z ⇒ 25,
[a-z] ⇒ 19,
_ ⇒ 13,
},
25: {
[0-9] ⇒ 25,
[A-Z] ⇒ 25,
[a-z] ⇒ 25,
_ ⇒ 16,
},
27: {
[0-9] ⇒ 19,
[A-Y] ⇒ 29,
Z ⇒ 25,
[a-z] ⇒ 29,
_ ⇒ 10,
},
29: {
[0-9] ⇒ 19,
[A-Y] ⇒ 29,
Z ⇒ 25,
[a-z] ⇒ 29,
_ ⇒ 10,
},
32: {
[0-9] ⇒ 25,
[A-Z] ⇒ 25,
[a-z] ⇒ 25,
_ ⇒ 16,
},
33: {
[00-/] ⇒ 1,
[0-9] ⇒ 22,
[:-@] ⇒ 1,
[A-Y] ⇒ 27,
Z ⇒ 32,
[[-`] ⇒ 1,
[a-z] ⇒ 27,
[{-7F] ⇒ 1,
[C2-DF] ⇒ 2,
[E0] ⇒ 3,
[E1-EC] ⇒ 4,
[ED] ⇒ 5,
[EE-EF] ⇒ 4,
[F0] ⇒ 6,
[F1-F3] ⇒ 7,
[F4] ⇒ 8,
},
}
Hum, so I think you are right about the fact that the error might come from no backtracking issue, but a perfect Logos implementation shouldn't have that issue.
First question: are you using Logos >=0.14.0? As it may have fixed some issues.
Second: did you try not setting any priority? Here, you set the priority to 3 to all tokens, it doesn't make much sense as the priority is only used when two or more patterns match the same slice, and they are differentiated based on their priority. But, if the number is the same, it doesn't help. So please try without any priority, and only edit one priority at a time.
Last, it is often a source of issues to have patterns embedded in others, like TermWithZ containing both Word and Number, causing backtracking issues.
Yes, using 0.14.1.
I had to set the priorities > 3 because of the skip expression,
#[logos(skip r".|[\r\n]")]
The "." can match anything, so there was a conflict. Other than that, the priorities can all be the same.
I'm not sure I understand the third point. How else would you do it?
On Sun, Sep 15, 2024 at 4:20 AM Jérome Eertmans @.***> wrote:
Hum, so I think you are right about the fact that the error might come from no backtracking issue, but a perfect Logos implementation shouldn't have that issue.
First question: are you using Logos >=0.14.0? As it may have fixed some issues.
Second: did you try not setting any priority? Here, you set the priority to 3 to all tokens, it doesn't make much sense as the priority is only used when two or more patterns match the same slice, and they are differentiated based on their priority. But, if the number is the same, it doesn't help. So please try without any priority, and only edit one priority at a time.
Last, it is often a source of issues to have patterns embedded in others, like TermWithZ containing both Word and Number, causing backtracking issues.
— Reply to this email directly, view it on GitHub https://github.com/maciejhirsz/logos/issues/420#issuecomment-2351487813, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAITHKRFDXP4BS3WI3SJVGDZWVGNFAVCNFSM6AAAAABOEMBCYWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJRGQ4DOOBRGM . You are receiving this because you authored the thread.Message ID: @.***>
Yes, using 0.14.1. I had to set the priorities
Ok perfect.
3 because of the skip expression, #[logos(skip r".|[\r\n]")] The "." can match anything, so there was a conflict. Other than that, the priorities can all be the same.
Ok seems legit, wasn't aware of that.
I'm not sure I understand the third point. How else would you do it?
Usually, you can break down your logic in unique, non-overlapping, tokens, and then use callbacks and extras to handle more complex logic. Unfortunately, I don't have enough time to dig into this problem and understand really the root causes of why it doesn't work :-/
It's good to see a bit of activity on the no-backtracking issue. I'm not fully convinced that that's what the problem is here, though.
Can I bump this issue? It has come up again in my project.
@0x2a-42 @jeertmans
Can I bump this issue? It has come up again in my project.
What do you mean with "bump"?
https://www.google.com/search?client=firefox-b-1-d&q=bump+a+post+definition
To "bump" a post means to bump it up on the priority list, that is, to bring it to your attention again.
On Mon, Jan 27, 2025 at 11:45 AM Jérome Eertmans @.***> wrote:
Can I bump this issue? It has come up again in my project.
Why do you mean with "bump"?
— Reply to this email directly, view it on GitHub https://github.com/maciejhirsz/logos/issues/420#issuecomment-2616492532, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAITHKRJOMKBVET4KBHBNO32MZWDXAVCNFSM6AAAAABOEMBCYWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJWGQ4TENJTGI . You are receiving this because you authored the thread.Message ID: @.***>
Here is a simpler version of the problem:
#[derive(Logos, Debug, PartialEq)]
#[logos(skip r" ")]
pub enum LogosTester {
#[regex(r"[a-z]+")]
Word,
#[regex(r"[0-9]+")]
Number,
// identifier followed by X
#[regex(r"[a-z]([a-z0-9])*X")]
Fieldname,
}
pub fn logos_test() {
let mut lex = LogosTester::lexer("a1");
while let Some(result) = lex.next() {
println!("{:?}", result);
}
}
Running this returns Err(()).
Here is the debug dump:
{
1: ::<skip> (<skip>),
3: ::Word,
6: ::Number,
7: {
[0-9] ⇒ 7,
_ ⇒ 6,
},
9: ::Fieldname,
10: X ⇒ 9,
11: {
[0-9] ⇒ 11,
[a-z] ⇒ 11,
_ ⇒ 10,
},
13: {
[0-9] ⇒ 11,
X ⇒ 9,
[a-z] ⇒ 13,
_ ⇒ 3,
},
15: {
⇒ 1,
[0-9] ⇒ 7,
[a-z] ⇒ 13,
},
}
@ccleve: Your example works with Herring, where the following minimized DFA is produced, which looks close enough to the Logos debug output.
flowchart LR
style start fill:#FFFFFF00, stroke:#FFFFFF00
start-->0;
0@{shape: circ}
0 -- "{0x20}" --> 1
0 -- "{'0'-'9'}" --> 2
0 -- "{'a'-'z'}" --> 3
1@{shape: dbl-circ}
2@{shape: dbl-circ}
2 -- "{'0'-'9'}" --> 2
3@{shape: dbl-circ}
3 -- "{'0'-'9'}" --> 4
3 -- "{'X'}" --> 5
3 -- "{'a'-'z'}" --> 3
4@{shape: circ}
4 -- "{'0'-'9', 'a'-'z'}" --> 4
4 -- "{'X'}" --> 5
5@{shape: dbl-circ}
skipped_regex_0[skipped regex]@{shape: rect}
1 .-> skipped_regex_0
Number_0[Number]@{shape: rect}
2 .-> Number_0
Word_0[Word]@{shape: rect}
3 .-> Word_0
Fieldname_0[Fieldname]@{shape: rect}
5 .-> Fieldname_0
Feel free to use that crate, if it solves your problem. It probably will not be merged into Logos anytime soon, as it currently requires an additional LLVM optimization to reach a similar performance as Logos.
The problem with the code generated by Logos seems to be, that in its state 11, where the end of input is reached, the state machine still transitions to state 10 where it will eventually return an error instead of the longest match.
fn goto11_ctx10_x<'s>(lex: &mut Lexer<'s>) {
while let Some(arr) = lex.read::<&[u8; 16]>() {
// unrolled loop ...
}
while lex.test(pattern0) {
lex.bump_unchecked(1);
}
goto10_ctx10_x(lex);
}
To fix this Logos would have to keep track of the last accepting state and the corresponding offset, when it was visited.
The equivalent state 4 in the code generated by Herring looks like this.
State::S4 => {
match lexer.next_byte() {
Some(b) if LUT0[b as usize] & 1u8 > 0 => {
state = State::S4;
continue;
}
Some(88u8) => {
state = State::S5;
continue;
}
None => {
lexer.offset -= 1;
break;
}
_ => break,
}
}
Here the None branch breaks out of the state machine loop and afterwards a Word token will be returned, as state 3 was the last accepting state that was visited.
State::S3 => {
last_accept = LastAccept::Token(LogosTester::Word, lexer.offset);
// ...
match last_accept {
LastAccept::None => {
use herring::Source;
while !lexer.source.is_boundary(lexer.offset) {
lexer.offset += 1;
}
return Some(Err(Default::default()));
}
LastAccept::Token(token, offset) => {
lexer.offset = offset;
return Some(Ok(token));
}
LastAccept::TokenCallback(callback, offset) => {
lexer.offset = offset;
return Some(callback(lexer));
}
LastAccept::Skip(offset) => {
lexer.offset = offset;
}
LastAccept::SkipCallback(callback, offset) => {
lexer.offset = offset;
callback(lexer);
}
}
https://www.google.com/search?client=firefox-b-1-d&q=bump+a+post+definition
To "bump" a post means to bump it up on the priority list, that is, to bring it to your attention again. …
I got this, but there is no such thing on GitHub, as we don't have tasks or projects. I will pin the issue, so it gives more visibility, but I barely put enough time into this project to answer questions, and I cannot really work on fixing bugs (especially complex ones).
I got this, but there is no such thing on GitHub, as we don't have tasks or projects. I will pin the issue, so it gives more visibility, but I barely put enough time into this project to answer questions, and I cannot really work on fixing bugs (especially complex ones).
I understand, and thank you for putting in all work you've done so far. It's a great project, and we're grateful for it. @jeertmans
@0x2a-42 Herring sounds terrific, and I'll plug it in and see how it does today.
Question: is it possible to get the actual code that Herring generates? I think it's better to generate actual text files at build time and check them into the project, the way LALRPOP does it. I've found it super useful to be able to examine the generated code, and in one small corner case I had to modify the generated LALRPOP code to make something work.
I got this, but there is no such thing on GitHub, as we don't have tasks or projects. I will pin the issue, so it gives more visibility, but I barely put enough time into this project to answer questions, and I cannot really work on fixing bugs (especially complex ones).
I understand, and thank you for putting in all work you've done so far. It's a great project, and we're grateful for it. @jeertmans
All credits are to @maciejhirsz, I did nothing but maintain the tool and write some docs, but thanks ;)
@ccleve: Thanks.
Question: is it possible to get the actual code that Herring generates? I think it's better to generate actual text files at build time and check them into the project, the way LALRPOP does it. I've found it super useful to be able to examine the generated code, and in one small corner case I had to modify the generated LALRPOP code to make something work.
Well, procedural macros are not really supposed to write files, so I would probably not add such a feature to the derive macro. The most sensible way to achieve this would be another input format (e.g. json) and an executable that generates code from this format.
If you don't need an automated process, you can just use cargo expand to inspect the generated code. It requires a tiny bit of manual cleanup if you use other derive macros such as Debug at the Token enum, as it will expand all macros.
@ccleve: Adding something like the logos-cli or a library that can be called from the build.rs file would also be a possibility.
@0x2a-42
Adding something like the logos-cli or a library that can be called from the build.rs file would also be a possibility.
Thanks. I was unaware of logos-cli. But I think the build.rs approach would be the best. This is all I have to do with LALRPOP:
lalrpop::Configuration::new()
.generate_in_source_tree()
.process()
.unwrap();
Fixed by #491
The following passes
mod issue_420 {
use logos::Logos;
#[derive(Logos, Debug, PartialEq)]
#[logos(skip r".|[\r\n]")] // skip everything not recognized
pub enum LogosToken<'s> {
// any letter except capital Z
#[regex(r"[a-zA-Y]+", priority = 3)]
WordExceptZ(&'s str),
// any number
#[regex(r"[0-9]+", priority = 3)]
Number(&'s str),
/*
This expression is:
(letter or number)* [Z] (letter or number)*
In other words, a token with any number of letters or numbers,
including at least one capital Z.
*/
#[regex(r"[a-zA-Z0-9]*[Z][a-zA-Z0-9]*", priority = 3)]
TermWithZ(&'s str),
}
#[test]
fn test_logos() {
use LogosToken::*;
let lex = LogosToken::lexer("hello 42world fooZfoo");
let tokens: Vec<_> = lex.collect();
assert_eq!(tokens, [
Ok(WordExceptZ("hello")),
Ok(Number("42")),
Ok(WordExceptZ("world")),
Ok(TermWithZ("fooZfoo")),
])
}
}