logos Error with overlapping token definitions

I'm getting a strange error when a regex could match the prefix of another regex. Maybe. I just don't know what the problem is. Here's a simplified case:


#[derive(Logos, Debug, PartialEq)]
#[logos(skip r".|[\r\n]")] // skip everything not recognized
pub enum LogosToken {
    // any letter except capital Z
    #[regex(r"[a-zA-Y]+", priority = 3)]
    WordExceptZ,

    // any number
    #[regex(r"[0-9]+", priority = 3)]
    Number,

    /*
    This expression is:
    (letter or number)* [Z] (letter or number)*
    In other words, a token with any number of letters or numbers,
    including at least one capital Z.
     */
    #[regex(r"[a-zA-Z0-9]*[Z][a-zA-Z0-9]*", priority = 3)]
    TermWithZ,
}

#[pg_extern]
fn test_logos() {
    let mut lex = LogosToken::lexer("hello 42world fooZfoo");
    while let Some(result) = lex.next() {
        let slice = lex.slice();
        println!("{:?} {:?}", slice, result);
    }
}

This generates:

"hello" Ok(WordExceptZ)
"42world" Err(())
"fooZfoo" Ok(TermWithZ)

If I replace the regex over TermWithZ with #[regex(r"Z", priority = 3)], I get:

"hello" Ok(WordExceptZ)
"42" Ok(Number)
"world" Ok(WordExceptZ)
"foo" Ok(WordExceptZ)
"Z" Ok(TermWithZ)
"foo" Ok(WordExceptZ)

The "42world" is getting recognized correctly as a number and word.

What I don't understand is, why does the first TermWithZ regex mess up the recognition of "42world"? It doesn't contain a Z, so TermWithZ should ignore it completely and let the first two variants do their job.

Sep 13 '24 01:09 ccleve

After looking at this a bit more, I'm guessing I'm running into the "no backtracking" limitation.

Any workaround suggestions are welcome.

Sep 13 '24 01:09 ccleve

Hello! Can you please run in debugging mode and printout the corresponding graph?

See https://logos.maciej.codes/debugging.html.

Sep 13 '24 08:09 jeertmans

@jeertmans

{
    1: ::<skip> (<skip>),
    2: [80-BF] ⇒ 1,
    3: [A0-BF][80-BF] ⇒ 1,
    4: [80-BF][80-BF] ⇒ 1,
    5: [80-9F][80-BF] ⇒ 1,
    6: [90-BF][80-BF][80-BF] ⇒ 1,
    7: [80-BF][80-BF][80-BF] ⇒ 1,
    8: [80-8F][80-BF][80-BF] ⇒ 1,
    10: ::WordExceptZ,
    13: ::Number,
    16: ::TermWithZ,
    17: {
        [0-9] ⇒ 17,
        [A-Z] ⇒ 17,
        [a-z] ⇒ 17,
        _ ⇒ 16,
    },
    18: Z ⇒ 17,
    19: {
        [0-9] ⇒ 19,
        [A-Z] ⇒ 19,
        [a-z] ⇒ 19,
        _ ⇒ 18,
    },
    22: {
        [0-9] ⇒ 24,
        [A-Y] ⇒ 19,
        Z ⇒ 25,
        [a-z] ⇒ 19,
        _ ⇒ 13,
    },
    24: {
        [0-9] ⇒ 24,
        [A-Y] ⇒ 19,
        Z ⇒ 25,
        [a-z] ⇒ 19,
        _ ⇒ 13,
    },
    25: {
        [0-9] ⇒ 25,
        [A-Z] ⇒ 25,
        [a-z] ⇒ 25,
        _ ⇒ 16,
    },
    27: {
        [0-9] ⇒ 19,
        [A-Y] ⇒ 29,
        Z ⇒ 25,
        [a-z] ⇒ 29,
        _ ⇒ 10,
    },
    29: {
        [0-9] ⇒ 19,
        [A-Y] ⇒ 29,
        Z ⇒ 25,
        [a-z] ⇒ 29,
        _ ⇒ 10,
    },
    32: {
        [0-9] ⇒ 25,
        [A-Z] ⇒ 25,
        [a-z] ⇒ 25,
        _ ⇒ 16,
    },
    33: {
        [00-/] ⇒ 1,
        [0-9] ⇒ 22,
        [:-@] ⇒ 1,
        [A-Y] ⇒ 27,
        Z ⇒ 32,
        [[-`] ⇒ 1,
        [a-z] ⇒ 27,
        [{-7F] ⇒ 1,
        [C2-DF] ⇒ 2,
        [E0] ⇒ 3,
        [E1-EC] ⇒ 4,
        [ED] ⇒ 5,
        [EE-EF] ⇒ 4,
        [F0] ⇒ 6,
        [F1-F3] ⇒ 7,
        [F4] ⇒ 8,
    },
}

Sep 13 '24 15:09 ccleve

Hum, so I think you are right about the fact that the error might come from no backtracking issue, but a perfect Logos implementation shouldn't have that issue.

First question: are you using Logos >=0.14.0? As it may have fixed some issues.

Second: did you try not setting any priority? Here, you set the priority to 3 to all tokens, it doesn't make much sense as the priority is only used when two or more patterns match the same slice, and they are differentiated based on their priority. But, if the number is the same, it doesn't help. So please try without any priority, and only edit one priority at a time.

Last, it is often a source of issues to have patterns embedded in others, like TermWithZ containing both Word and Number, causing backtracking issues.

Sep 15 '24 09:09 jeertmans

Yes, using 0.14.1.

I had to set the priorities > 3 because of the skip expression,

#[logos(skip r".|[\r\n]")]

The "." can match anything, so there was a conflict. Other than that, the priorities can all be the same.

I'm not sure I understand the third point. How else would you do it?

On Sun, Sep 15, 2024 at 4:20 AM Jérome Eertmans @.***> wrote:

Hum, so I think you are right about the fact that the error might come from no backtracking issue, but a perfect Logos implementation shouldn't have that issue.

First question: are you using Logos >=0.14.0? As it may have fixed some issues.

Second: did you try not setting any priority? Here, you set the priority to 3 to all tokens, it doesn't make much sense as the priority is only used when two or more patterns match the same slice, and they are differentiated based on their priority. But, if the number is the same, it doesn't help. So please try without any priority, and only edit one priority at a time.

Last, it is often a source of issues to have patterns embedded in others, like TermWithZ containing both Word and Number, causing backtracking issues.

— Reply to this email directly, view it on GitHub https://github.com/maciejhirsz/logos/issues/420#issuecomment-2351487813, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAITHKRFDXP4BS3WI3SJVGDZWVGNFAVCNFSM6AAAAABOEMBCYWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJRGQ4DOOBRGM . You are receiving this because you authored the thread.Message ID: @.***>

Sep 15 '24 15:09 ccleve

Yes, using 0.14.1. I had to set the priorities

Ok perfect.

3 because of the skip expression, #[logos(skip r".|[\r\n]")] The "." can match anything, so there was a conflict. Other than that, the priorities can all be the same.

Ok seems legit, wasn't aware of that.

I'm not sure I understand the third point. How else would you do it?

Usually, you can break down your logic in unique, non-overlapping, tokens, and then use callbacks and extras to handle more complex logic. Unfortunately, I don't have enough time to dig into this problem and understand really the root causes of why it doesn't work :-/

Sep 16 '24 07:09 jeertmans

It's good to see a bit of activity on the no-backtracking issue. I'm not fully convinced that that's what the problem is here, though.

Can I bump this issue? It has come up again in my project.

@0x2a-42 @jeertmans

Jan 27 '25 15:01 ccleve

Can I bump this issue? It has come up again in my project.

What do you mean with "bump"?

Jan 27 '25 17:01 jeertmans

https://www.google.com/search?client=firefox-b-1-d&q=bump+a+post+definition

To "bump" a post means to bump it up on the priority list, that is, to bring it to your attention again.

On Mon, Jan 27, 2025 at 11:45 AM Jérome Eertmans @.***> wrote:

Can I bump this issue? It has come up again in my project.

Why do you mean with "bump"?

— Reply to this email directly, view it on GitHub https://github.com/maciejhirsz/logos/issues/420#issuecomment-2616492532, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAITHKRJOMKBVET4KBHBNO32MZWDXAVCNFSM6AAAAABOEMBCYWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJWGQ4TENJTGI . You are receiving this because you authored the thread.Message ID: @.***>

Jan 27 '25 17:01 ccleve

Here is a simpler version of the problem:

#[derive(Logos, Debug, PartialEq)]
#[logos(skip r" ")]
pub enum LogosTester {
    #[regex(r"[a-z]+")]
    Word,

    #[regex(r"[0-9]+")]
    Number,

    // identifier followed by X
    #[regex(r"[a-z]([a-z0-9])*X")]
    Fieldname,
}

pub fn logos_test() {
    let mut lex = LogosTester::lexer("a1");
    while let Some(result) = lex.next() {
        println!("{:?}", result);
    }
}

Running this returns Err(()).

Here is the debug dump:

{
    1: ::<skip> (<skip>),
    3: ::Word,
    6: ::Number,
    7: {
        [0-9] ⇒ 7,
        _ ⇒ 6,
    },
    9: ::Fieldname,
    10: X ⇒ 9,
    11: {
        [0-9] ⇒ 11,
        [a-z] ⇒ 11,
        _ ⇒ 10,
    },
    13: {
        [0-9] ⇒ 11,
        X ⇒ 9,
        [a-z] ⇒ 13,
        _ ⇒ 3,
    },
    15: {
          ⇒ 1,
        [0-9] ⇒ 7,
        [a-z] ⇒ 13,
    },
}

Jan 27 '25 18:01 ccleve

@ccleve: Your example works with Herring, where the following minimized DFA is produced, which looks close enough to the Logos debug output.

flowchart LR
style start fill:#FFFFFF00, stroke:#FFFFFF00
start-->0;
0@{shape: circ}
0 -- "{0x20}" --> 1
0 -- "{'0'-'9'}" --> 2
0 -- "{'a'-'z'}" --> 3
1@{shape: dbl-circ}
2@{shape: dbl-circ}
2 -- "{'0'-'9'}" --> 2
3@{shape: dbl-circ}
3 -- "{'0'-'9'}" --> 4
3 -- "{'X'}" --> 5
3 -- "{'a'-'z'}" --> 3
4@{shape: circ}
4 -- "{'0'-'9', 'a'-'z'}" --> 4
4 -- "{'X'}" --> 5
5@{shape: dbl-circ}
skipped_regex_0[skipped regex]@{shape: rect}
1 .-> skipped_regex_0
Number_0[Number]@{shape: rect}
2 .-> Number_0
Word_0[Word]@{shape: rect}
3 .-> Word_0
Fieldname_0[Fieldname]@{shape: rect}
5 .-> Fieldname_0

Feel free to use that crate, if it solves your problem. It probably will not be merged into Logos anytime soon, as it currently requires an additional LLVM optimization to reach a similar performance as Logos.

The problem with the code generated by Logos seems to be, that in its state 11, where the end of input is reached, the state machine still transitions to state 10 where it will eventually return an error instead of the longest match.

fn goto11_ctx10_x<'s>(lex: &mut Lexer<'s>) {
    while let Some(arr) = lex.read::<&[u8; 16]>() {
        // unrolled loop ...
    }
    while lex.test(pattern0) {
        lex.bump_unchecked(1);
    }
    goto10_ctx10_x(lex);
}

To fix this Logos would have to keep track of the last accepting state and the corresponding offset, when it was visited.

The equivalent state 4 in the code generated by Herring looks like this.

State::S4 => {
    match lexer.next_byte() {
        Some(b) if LUT0[b as usize] & 1u8 > 0 => {
            state = State::S4;
            continue;
        }
        Some(88u8) => {
            state = State::S5;
            continue;
        }
        None => {
            lexer.offset -= 1;
            break;
        }
        _ => break,
    }
}

Here the None branch breaks out of the state machine loop and afterwards a Word token will be returned, as state 3 was the last accepting state that was visited.

State::S3 => {
    last_accept = LastAccept::Token(LogosTester::Word, lexer.offset);
    // ...

match last_accept {
    LastAccept::None => {
        use herring::Source;
        while !lexer.source.is_boundary(lexer.offset) {
            lexer.offset += 1;
        }
        return Some(Err(Default::default()));
    }
    LastAccept::Token(token, offset) => {
        lexer.offset = offset;
        return Some(Ok(token));
    }
    LastAccept::TokenCallback(callback, offset) => {
        lexer.offset = offset;
        return Some(callback(lexer));
    }
    LastAccept::Skip(offset) => {
        lexer.offset = offset;
    }
    LastAccept::SkipCallback(callback, offset) => {
        lexer.offset = offset;
        callback(lexer);
    }
}

Jan 27 '25 22:01 0x2a-42

https://www.google.com/search?client=firefox-b-1-d&q=bump+a+post+definition

To "bump" a post means to bump it up on the priority list, that is, to bring it to your attention again. …

I got this, but there is no such thing on GitHub, as we don't have tasks or projects. I will pin the issue, so it gives more visibility, but I barely put enough time into this project to answer questions, and I cannot really work on fixing bugs (especially complex ones).

Jan 28 '25 09:01 jeertmans

I got this, but there is no such thing on GitHub, as we don't have tasks or projects. I will pin the issue, so it gives more visibility, but I barely put enough time into this project to answer questions, and I cannot really work on fixing bugs (especially complex ones).

I understand, and thank you for putting in all work you've done so far. It's a great project, and we're grateful for it. @jeertmans

Jan 28 '25 14:01 ccleve

@0x2a-42 Herring sounds terrific, and I'll plug it in and see how it does today.

Question: is it possible to get the actual code that Herring generates? I think it's better to generate actual text files at build time and check them into the project, the way LALRPOP does it. I've found it super useful to be able to examine the generated code, and in one small corner case I had to modify the generated LALRPOP code to make something work.

Jan 28 '25 14:01 ccleve

I got this, but there is no such thing on GitHub, as we don't have tasks or projects. I will pin the issue, so it gives more visibility, but I barely put enough time into this project to answer questions, and I cannot really work on fixing bugs (especially complex ones).

I understand, and thank you for putting in all work you've done so far. It's a great project, and we're grateful for it. @jeertmans

All credits are to @maciejhirsz, I did nothing but maintain the tool and write some docs, but thanks ;)

Jan 28 '25 14:01 jeertmans

@ccleve: Thanks.

Question: is it possible to get the actual code that Herring generates? I think it's better to generate actual text files at build time and check them into the project, the way LALRPOP does it. I've found it super useful to be able to examine the generated code, and in one small corner case I had to modify the generated LALRPOP code to make something work.

Well, procedural macros are not really supposed to write files, so I would probably not add such a feature to the derive macro. The most sensible way to achieve this would be another input format (e.g. json) and an executable that generates code from this format.

If you don't need an automated process, you can just use cargo expand to inspect the generated code. It requires a tiny bit of manual cleanup if you use other derive macros such as Debug at the Token enum, as it will expand all macros.

Jan 28 '25 18:01 0x2a-42

@ccleve: Adding something like the logos-cli or a library that can be called from the build.rs file would also be a possibility.

Jan 28 '25 18:01 0x2a-42

@0x2a-42

Adding something like the logos-cli or a library that can be called from the build.rs file would also be a possibility.

Thanks. I was unaware of logos-cli. But I think the build.rs approach would be the best. This is all I have to do with LALRPOP:

lalrpop::Configuration::new()
    .generate_in_source_tree()
    .process()
    .unwrap();

Jan 28 '25 19:01 ccleve

Fixed by #491

The following passes

mod issue_420 {
    use logos::Logos;

    #[derive(Logos, Debug, PartialEq)]
    #[logos(skip r".|[\r\n]")] // skip everything not recognized
    pub enum LogosToken<'s> {
        // any letter except capital Z
        #[regex(r"[a-zA-Y]+", priority = 3)]
        WordExceptZ(&'s str),

        // any number
        #[regex(r"[0-9]+", priority = 3)]
        Number(&'s str),

        /*
        This expression is:
        (letter or number)* [Z] (letter or number)*
        In other words, a token with any number of letters or numbers,
        including at least one capital Z.
         */
        #[regex(r"[a-zA-Z0-9]*[Z][a-zA-Z0-9]*", priority = 3)]
        TermWithZ(&'s str),
    }

    #[test]
    fn test_logos() {
        use LogosToken::*;
        let lex = LogosToken::lexer("hello 42world fooZfoo");
        let tokens: Vec<_> = lex.collect();
        assert_eq!(tokens, [
            Ok(WordExceptZ("hello")),
            Ok(Number("42")),
            Ok(WordExceptZ("world")),
            Ok(TermWithZ("fooZfoo")),
        ])
    }
}

Nov 22 '25 00:11 robot-rover