logos icon indicating copy to clipboard operation
logos copied to clipboard

Detect EOF in regular expressions

Open ebkalderon opened this issue 4 years ago • 1 comments

It would be nice if the \W anchor was supported in #[regex] strings such that certain regular expressions could succeed if EOF is encountered early. For example:

#[derive(Logos)]
enum Token {
    #[end]
    End,
    #[error]
    Error,
    #[regex = r"/\*([^*]|\*+[^*/]|[^\W])*(\*+/|\W)"]
    BlockComment,
}

The regex shown above would be able to match both /* hello */ and also /* hello, where the second one is immediately followed by an #[end]. This would allow for the lexer to process the token successfully even if the input terminates early before the */ could be received, and we could check whether the comment is properly terminated by calling lexer.slice().ends_with("*/") at a later stage.

See TokenKind::BlockComment { terminated: bool } in rustc_lexer for an example of this common lexing pattern.

ebkalderon avatar Mar 14 '20 16:03 ebkalderon

Might still add this to regex itself, but you will be able to handle this with callbacks next release, reference: #103.

maciejhirsz avatar Apr 04 '20 09:04 maciejhirsz

Closing this a too old, feel free to re-open if needed.

jeertmans avatar Feb 07 '24 11:02 jeertmans