logos icon indicating copy to clipboard operation
logos copied to clipboard

Regex does not match expected input

Open chrisdesa opened this issue 3 years ago • 9 comments

The regex /\*(?:[^*]|\*+[^*/])+\*+/ should match any C-style block comment, and indeed it does seem to match such strings in the regex library. But in logos, it seems to cause an error. E.g. if I run the following code

use logos::Logos;
use regex::Regex;

#[derive(Logos, Debug, PartialEq)]
enum Token {
    // Tokens can be literal strings, of any length.
    #[token("fast")]
    Fast,

    #[token(".")]
    Period,

    // Or regular expressions.
    #[regex("[a-zA-Z]+")]
    Text,

    #[regex(r"/\*(?:[^*]|\*+[^*/])+\*+/")]
    Comment,

    #[regex(r"[ \n\t\f]+", logos::skip)]
    #[error]
    Error,
}

fn main() {
    let re = Regex::new(r"^/\*(?:[^*]|\*+[^*/])+\*+/$").unwrap();
    assert!(re.is_match("/* comment */"));

    let mut lex = Token::lexer("Create ridiculously /* comment */ fast Lexers.");

    while let Some(t) = lex.next() {
    	println!("{:?}: {}", t, lex.slice())
    }
}

This code outputs

Text: Create
Text: ridiculously
Error: /* comment *
Error: /
Fast: fast
Text: Lexers
Period: .

which is not what I would expect: I would expect the Comment token to match the /* comment */ in the string to be lexed.

chrisdesa avatar Nov 24 '20 20:11 chrisdesa

Your regex is a bit too eager for Logos, try: r"/\*([^*]|\*[^/])*\*/".

maciejhirsz avatar Nov 29 '20 09:11 maciejhirsz

r"/\*([^*]|\*[^/])*\*/"

Unfortunately this matches /* **/ hi */ as one huge comment -- a minimal fix is to replace \*[^/] with \**[^*/].

mb64 avatar Nov 29 '20 11:11 mb64

Your regex is a bit too eager for Logos, try: r"/\*([^*]|\*[^/])*\*/".

In addition to matching /* **/ hi */ in the way described above, this also does not match /* comment **/. And the proposed change by @mb64 fixes the issue (although I also need to add a + after the last \*), but then the original bug comes back! When I use the regular expression regex(r"/\*([^*]|\**[^*/])*\*+/" I get the same error as before (which is not that surprising because it is essentially the same regex).

Is the issue that Logos doesn't support regular languages with star height greater than 1?

chrisdesa avatar Nov 30 '20 04:11 chrisdesa

Is the issue that Logos doesn't support regular languages with star height greater than 1?

No, the problem is that logos works by optimally only ever reading any byte once, and it has very limited backtracking (unlike full blown regex), I thought the block comment was a solved case but apparently I'm wrong. I'll have to take a deeper dive to see what can be done there, it might also make sense for me to take a look at regex-automata as a replacement for the graph resolution Logos is doing atm, but that's more of another moonshot than a quick fix.

In either case, Logos has pretty powerful callbacks now, so you can handle block comments with them:

#[token("/*", |lex| {
    let len = lex.remainder().find("*/")?;
    lex.bump(len + 2); // include len of `*/`

    Some(())
})]
Comment,

This will be tiny bit slower than the generated code that macro produces, but shouldn't be much of an issue.

maciejhirsz avatar Dec 01 '20 10:12 maciejhirsz

Using your explanation, I think I was able to fix the problem: or at least get the C-style block comments to work without needing the callback! The issue seems to be the backtracking as you said, so the fix that seems to work is to add an Error case that just consumes the rest of the file if it sees an open block comment with no close. E.g., this code:

use logos::Logos;

#[derive(Logos, Debug, PartialEq)]
enum Token {
    // Tokens can be literal strings, of any length.
    #[token("fast")]
    Fast,

    #[token(".")]
    Period,

    // Or regular expressions.
    #[regex("[a-zA-Z]+")]
    Text,

    #[regex(r"/\*([^*]|\**[^*/])*\*+/")]
    Comment,

    #[regex(r"[ \n\t\f]+", logos::skip)]
    #[regex(r"/\*([^*]|\*+[^*/])*\*?")]  // added error to eat unclosed comments
    #[error]
    Error,
}

fn main() {
    let mut lex = Token::lexer("Create even /* comment **/ lexers /* comment 2 */");

    while let Some(t) = lex.next() {
    	println!("{:?}: {}", t, lex.slice())
    }
}

correctly outputs

Text: Create
Text: even
Comment: /* comment **/
Text: lexers
Comment: /* comment 2 */

whereas if I comment out that added error line it gives, as before,

Text: Create
Text: even
Error: /* comment **
Error: /
Text: lexers
Error: /* comment 2 *
Error: /

Hopefully this helps with diagnosing the problem (if there even is a problem).

chrisdesa avatar Dec 01 '20 18:12 chrisdesa

@chrisdesa that's definitely helpful, thank you!

maciejhirsz avatar Dec 01 '20 18:12 maciejhirsz

@maciejhirsz if I understand the drawback to the callback solution is it does not support logos::skip, which is handy if you just want to ignore comments. I found this issue though has already pointed that out: https://github.com/maciejhirsz/logos/issues/235

evbo avatar Feb 17 '22 17:02 evbo

The title of this issue is a bit misleading, as there is neither a compilation nor runtime error discussed.

It would be better to change it to "wrong result" or something like that. The result is erroneous, but not in the strict sense of Error.

corneliusroemer avatar Dec 26 '23 12:12 corneliusroemer

done @corneliusroemer :-)

jeertmans avatar Dec 26 '23 13:12 jeertmans