logos Common regexes for the Logos Handbook

Not to go offtopic in #132, but it's pretty common that someone ends up reinventing a regex for something that can be done better, or has problems finding one that works (#124). This is doubly problematic since Logos doesn't support non-greedy matching (for good reasons). There is a lot of overlap when lexing programming languages on things like:

Comments
Quoted strings
Floating point numbers with scientific notation

I already started noodling on a book to supplement the API docs, and I plan on having a chapter with commonly used regexes so people can look things up quickly. If you can think of something that might be a common use case, please add it in the comments below so it can be included in the book.

Apr 25 '20 09:04 maciejhirsz

It's probably worth including an example for lexing (most of?) Rust's lexical structure, because that serves as a common baseline that users of logos probably understand. That said, here's a dump of the regex I've been using for the tokens I think are likely to be more generally reusable:

line comment (excluding trailing newline): //[^\n]*
line comment (including trailing newline): //[^\n]*\n?
block comment (unnested): /\*(?:[^*]|\*[^/])*\*/
identifier (UAX#31): \p{XID_Start}\p{XID_Continue}*
identifier (Rust): [\p{XID_Start}_]\p{XID_Continue}*
identifier (traditional ASCII): [_a-zA-Z][_0-9a-zA-Z]*
binary integer: 0b_*[01][_01]*
octal integer 0o_*[0-7][_0-7]*
decimal integer: [1-9][_1-9]*
decimal float: (?&digits)(?:e(?&digits)|\.(?&digits)(?:e(?&digits))?)
string (minimal escapes): "(?:[^"]|\\")*"

Apr 25 '20 21:04 CAD97

I regret nothing.

[+-]?(([0-9][_0-9]*\.([0-9][_0-9]*)?([eE][+-]?([0-9][_0-9]*))?[fFdD]?)|(\.([0-9][_0-9]*)([eE][+-]?([0-9][_0-9]*))?[fFdD]?)|(([0-9][_0-9]*)([eE][+-]?([0-9][_0-9]*))[fFdD]?)|(([0-9][_0-9]*)([eE][+-]?([0-9][_0-9]*))?[fFdD]))

This matches floating point java style literals.

So, it'll match stuff like:

2.2F
3.5F
.0
.2D
.2F
.2e2
-2e2

And reject stuff like:

1
.F

Sep 04 '20 14:09 Jezza

I found out about subpatterns about 3 hours ago and in that time I cleaned up the regexes I use.

use logos::Logos;

#[derive(Logos)]
#[logos(subpattern decimal = r"[0-9][_0-9]*")]
#[logos(subpattern hex = r"[0-9a-fA-F][_0-9a-fA-F]*")]
#[logos(subpattern octal = r"[0-7][_0-7]*")]
#[logos(subpattern binary = r"[0-1][_0-1]*")]
#[logos(subpattern exp = r"[eE][+-]?[0-9][_0-9]*")]
enum TokenKind {
	#[regex(r"//.*\n?", logos::skip)]
	#[regex(r"[ \t\n\f]+", logos::skip)]
	#[error]
	Error,

	#[regex("(?&decimal)")]
	Integer,

	#[regex("0[xX](?&hex)")]
	HexInteger,

	#[regex("0[oO](?&octal)")]
	OctalInteger,

	#[regex("0[bB](?&binary)")]
	BinaryInteger,

	#[regex(r#"[+-]?(((?&decimal)\.(?&decimal)?(?&exp)?[fFdD]?)|(\.(?&decimal)(?&exp)?[fFdD]?)|((?&decimal)(?&exp)[fFdD]?)|((?&decimal)(?&exp)?[fFdD]))"#)]
	Float,

	#[regex(r"0[xX](((?&hex))|((?&hex)\.)|((?&hex)?\.(?&hex)))[pP][+-]?(?&decimal)[fFdD]?")]
	HexFloat,
}

This isn't all of the tokens I use, but I feel like a lot of these are very common. (With the exception of HexFloat...)

It should be a good starting point for a lot of people wanting to not care about numbers (Specifically floats).

Sep 04 '20 17:09 Jezza

block comment (unnested): /*(?:[^*]|\*[^/])*\*/

This regex has a few issues:

Missing \ before the first *
Fails to match /***/

Here is the version I am using: /\*([^*]|\*+[^*/])*\*+/ (test cases)

Mar 03 '21 06:03 HactarCE

@HactarCE I'm getting an [Error] with that regex if I try to do something as simple as:

#[derive(Logos, Debug, PartialEq, Copy, Clone)]
pub(crate) enum CommentLexer {
    #[regex(r"/\*([^*]|\*+[^*/])*\*+/")]
    Comment,
    #[error]
    Error,
}

and try to lex /* basic */ (or any comment really). Any idea what's wrong?

Edit: got it working via callbacks https://github.com/maciejhirsz/logos/issues/180#issuecomment-736401091

May 22 '21 07:05 Keats

\p{XID_Start}\p{XID_Continue}*

Maybe that's intentional, but this regex does not match identifiers starting with underscore. I personally use (\p{XID_Start}|_)\p{XID_Continue}*.

https://regex101.com/r/4gYRqp/1 https://regex101.com/r/z7glYD/1

Jul 28 '23 11:07 elenakrittik

Hello @elenakrittik, do you know if regex-syntax supports that kind of regexes?

Aug 02 '23 15:08 jeertmans

Not sure which "kind" are you referring to.. If you're talking about \ps then i think yes, because lexer tests (which include multiple languages and symbols, and combinations of them) i've written for the above-mentioned regex pass just fine.

EDIT: Specifically, this test does pass:

#[test]
fn test_identifier() {
    test_eq!("x", Token::Identifier("x"));
    test_eq!("xyz", Token::Identifier("xyz"));
    test_eq!("XYZ", Token::Identifier("XYZ"));
    test_eq!("X1", Token::Identifier("X1"));
    test_eq!("X1X", Token::Identifier("X1X"));
    test_eq!("X_", Token::Identifier("X_"));
    test_eq!("X_X", Token::Identifier("X_X"));
    test_eq!("_X", Token::Identifier("_X"));
    test_eq!("X1_", Token::Identifier("X1_"));
    test_eq!("X_1", Token::Identifier("X_1"));
    test_eq!("X_1X", Token::Identifier("X_1X"));
    test_eq!("X1_X", Token::Identifier("X1_X"));
    test_eq!("X__X", Token::Identifier("X__X"));
    test_eq!("X_X1", Token::Identifier("X_X1"));

    test_eq!("你", Token::Identifier("你"));
    test_eq!("你好", Token::Identifier("你好"));
    test_eq!("你1", Token::Identifier("你1"));
    test_eq!("你1你", Token::Identifier("你1你"));
    test_eq!("你_", Token::Identifier("你_"));
    test_eq!("你_你", Token::Identifier("你_你"));
    test_eq!("_你", Token::Identifier("_你"));
    test_eq!("你1_", Token::Identifier("你1_"));
    test_eq!("你_1", Token::Identifier("你_1"));
    test_eq!("你_1你", Token::Identifier("你_1你"));
    test_eq!("你1_你", Token::Identifier("你1_你"));
    test_eq!("你__你", Token::Identifier("你__你"));
    test_eq!("你_你1", Token::Identifier("你_你1"));
    
    test_eq!("п", Token::Identifier("п"));
    test_eq!("привет", Token::Identifier("привет"));
    test_eq!("ПРИВЕТ", Token::Identifier("ПРИВЕТ"));
    test_eq!("П1", Token::Identifier("П1"));
    test_eq!("П1П", Token::Identifier("П1П"));
    test_eq!("П_", Token::Identifier("П_"));
    test_eq!("П_П", Token::Identifier("П_П"));
    test_eq!("_П", Token::Identifier("_П"));
    test_eq!("П1_", Token::Identifier("П1_"));
    test_eq!("П_1", Token::Identifier("П_1"));
    test_eq!("П_1П", Token::Identifier("П_1П"));
    test_eq!("П1_П", Token::Identifier("П1_П"));
    test_eq!("П__П", Token::Identifier("П__П"));
    test_eq!("П_П1", Token::Identifier("П_П1"));
}

Token::Identifier is defined as follows:

    #[regex(r"(\p{XID_Start}|_)\p{XID_Continue}*")]
    Identifier(&'a str),

test_eq is just a custom convenience macro that builds a Lexer from an input string, runs it, collects all Oks from it into a Vec and compares with another Vec constructed from the rest of macro arguments.

Aug 02 '23 16:08 elenakrittik

Can this be caused by the fact that we use a previous version of regex-syntax? Can you try using logos from #320?

Aug 02 '23 17:08 jeertmans

Sorry for the delay.

Assuming this input:

someident
_someident

..this logos dependency:

logos = { git = "https://github.com/jeertmans/logos.git", branch = "bump-regex-syntax" }

..and this token definition:

    #[regex(r"\p{XID_Start}\p{XID_Continue}*")]
    Identifier(&'a str),

Running logos yields the following result (parse currently sorts all errors and token by their spans and then prints them in order):

    Running `target/debug/gdtk parse --file incremental.gd`
Identifier("someident")
Newline
error: Unknown character.
--> incremental.gd:10..11
Identifier("someident")

So it seems regex-syntax defers the job of transforming \ps to the actual regex "executor", and that "executor" mistakenly (or not?) does not count underscore as a valid XID_Start character.

Aug 10 '23 11:08 elenakrittik

That's definitely a bug because executing the following code

use regex_syntax::parse;

fn main() {
    let regex = r"(\p{XID_Start}|_)\p{XID_Continue}*";
    let hir = parse(regex).unwrap();
    
    println!("{hir:#?}");
}

on the rust playground yields a seemingly correct HIR, which is used to construct Logo's MIR.

Aug 10 '23 12:08 jeertmans

Perhaps you wanted to test the original regex? In your snippet you're using an alternation to include underscore, but parsing \p{XID_Start} by itself still does not emit underscore as far as i can tell. (so maybe that is a bug, but in regex-syntax and not logos?)

Aug 10 '23 15:08 elenakrittik

Oh yeah, I didn't notice ...|_) was actually a fix.

But are you sure XID_Start includes the underscore? I cannot see that anywhere on https://unicode.org/reports/tr31/#Table_Lexical_Classes_for_Identifiers.

For example, the unicode-ident crate, specialized for XID_Start and XID_Continue, returns false when checking if _ matches XID_Start.

use unicode_ident::*;

fn main() {
    let b = is_xid_start('_');
    
    println!("{b}");
}

See rust playground.

Aug 10 '23 15:08 jeertmans

mistakenly (or not?)

I were not sure whether this is an intended behaviour or a bug in any library in the supply chain. With your comment it seems like underscore indeed is not part of the XID_Start group (which feels strange, since most, if not all, languages allow identifiers to start with underscore). Due to the reason inside parentheses, i think it will be right to ask @CAD97 to edit their regex to include underscore alternation so that less people coming across this thread fall into this trap.

Aug 10 '23 16:08 elenakrittik

Well maybe it’s worth creating an issue on the regex-syntax crate :) But if that’s handwritten in the Unicode rules, I don’t think that will ever change 😅

Aug 10 '23 20:08 jeertmans

Unicode® Standard Annex #31 UNICODE IDENTIFIER AND PATTERN SYNTAX §2.4 Specific Character Adjustments calls _ out specifically for being XID_Continue but not XID_Start, but commonly included into the Start class for identifiers, as well as in §1.2 Customization. Representing the Rust identifier syntax it is of course correct to include _ in Start. But for generic identifier syntax, it is better to use the unmodified character classes and point at UAX#31 such that language designers can make an informed decision. Implementors of an existing grammar are expected to know what the grammar they're implementing calls for.

Aug 10 '23 20:08 CAD97

Closing this as I feel this is now documented in the handbook. If needed, feel free to re-open this issue!

Feb 13 '24 09:02 jeertmans

logos logos copied to clipboard

Common regexes for the Logos Handbook

logos
logos copied to clipboard