logos icon indicating copy to clipboard operation
logos copied to clipboard

Common regexes for the Logos Handbook

Open maciejhirsz opened this issue 4 years ago • 5 comments

Not to go offtopic in #132, but it's pretty common that someone ends up reinventing a regex for something that can be done better, or has problems finding one that works (#124). This is doubly problematic since Logos doesn't support non-greedy matching (for good reasons). There is a lot of overlap when lexing programming languages on things like:

  • Comments
  • Quoted strings
  • Floating point numbers with scientific notation

I already started noodling on a book to supplement the API docs, and I plan on having a chapter with commonly used regexes so people can look things up quickly. If you can think of something that might be a common use case, please add it in the comments below so it can be included in the book.

maciejhirsz avatar Apr 25 '20 09:04 maciejhirsz

It's probably worth including an example for lexing (most of?) Rust's lexical structure, because that serves as a common baseline that users of logos probably understand. That said, here's a dump of the regex I've been using for the tokens I think are likely to be more generally reusable:

  • line comment (excluding trailing newline): //[^\n]*
  • line comment (including trailing newline): //[^\n]*\n?
  • block comment (unnested): /\*(?:[^*]|\*[^/])*\*/
  • identifier (UAX#31): \p{XID_Start}\p{XID_Continue}*
  • identifier (Rust): [\p{XID_Start}_]\p{XID_Continue}*
  • identifier (traditional ASCII): [_a-zA-Z][_0-9a-zA-Z]*
  • binary integer: 0b_*[01][_01]*
  • octal integer 0o_*[0-7][_0-7]*
  • decimal integer: [1-9][_1-9]*
  • decimal float: (?&digits)(?:e(?&digits)|\.(?&digits)(?:e(?&digits))?)
  • string (minimal escapes): "(?:[^"]|\\")*"

CAD97 avatar Apr 25 '20 21:04 CAD97

I regret nothing.

[+-]?(([0-9][_0-9]*\.([0-9][_0-9]*)?([eE][+-]?([0-9][_0-9]*))?[fFdD]?)|(\.([0-9][_0-9]*)([eE][+-]?([0-9][_0-9]*))?[fFdD]?)|(([0-9][_0-9]*)([eE][+-]?([0-9][_0-9]*))[fFdD]?)|(([0-9][_0-9]*)([eE][+-]?([0-9][_0-9]*))?[fFdD]))

This matches floating point java style literals.

So, it'll match stuff like:

2.2F
3.5F
.0
.2D
.2F
.2e2
-2e2

And reject stuff like:

1
.F

image

Jezza avatar Sep 04 '20 14:09 Jezza

I found out about subpatterns about 3 hours ago and in that time I cleaned up the regexes I use.

use logos::Logos;

#[derive(Logos)]
#[logos(subpattern decimal = r"[0-9][_0-9]*")]
#[logos(subpattern hex = r"[0-9a-fA-F][_0-9a-fA-F]*")]
#[logos(subpattern octal = r"[0-7][_0-7]*")]
#[logos(subpattern binary = r"[0-1][_0-1]*")]
#[logos(subpattern exp = r"[eE][+-]?[0-9][_0-9]*")]
enum TokenKind {
	#[regex(r"//.*\n?", logos::skip)]
	#[regex(r"[ \t\n\f]+", logos::skip)]
	#[error]
	Error,

	#[regex("(?&decimal)")]
	Integer,

	#[regex("0[xX](?&hex)")]
	HexInteger,

	#[regex("0[oO](?&octal)")]
	OctalInteger,

	#[regex("0[bB](?&binary)")]
	BinaryInteger,

	#[regex(r#"[+-]?(((?&decimal)\.(?&decimal)?(?&exp)?[fFdD]?)|(\.(?&decimal)(?&exp)?[fFdD]?)|((?&decimal)(?&exp)[fFdD]?)|((?&decimal)(?&exp)?[fFdD]))"#)]
	Float,

	#[regex(r"0[xX](((?&hex))|((?&hex)\.)|((?&hex)?\.(?&hex)))[pP][+-]?(?&decimal)[fFdD]?")]
	HexFloat,
}

This isn't all of the tokens I use, but I feel like a lot of these are very common. (With the exception of HexFloat...)

It should be a good starting point for a lot of people wanting to not care about numbers (Specifically floats).

Jezza avatar Sep 04 '20 17:09 Jezza

  • block comment (unnested): /*(?:[^*]|\*[^/])*\*/

This regex has a few issues:

  • Missing \ before the first *
  • Fails to match /***/

Here is the version I am using: /\*([^*]|\*+[^*/])*\*+/ (test cases)

HactarCE avatar Mar 03 '21 06:03 HactarCE

@HactarCE I'm getting an [Error] with that regex if I try to do something as simple as:

#[derive(Logos, Debug, PartialEq, Copy, Clone)]
pub(crate) enum CommentLexer {
    #[regex(r"/\*([^*]|\*+[^*/])*\*+/")]
    Comment,
    #[error]
    Error,
}

and try to lex /* basic */ (or any comment really). Any idea what's wrong?

Edit: got it working via callbacks https://github.com/maciejhirsz/logos/issues/180#issuecomment-736401091

Keats avatar May 22 '21 07:05 Keats

  • \p{XID_Start}\p{XID_Continue}*

Maybe that's intentional, but this regex does not match identifiers starting with underscore. I personally use (\p{XID_Start}|_)\p{XID_Continue}*.

https://regex101.com/r/4gYRqp/1 https://regex101.com/r/z7glYD/1

elenakrittik avatar Jul 28 '23 11:07 elenakrittik

Hello @elenakrittik, do you know if regex-syntax supports that kind of regexes?

jeertmans avatar Aug 02 '23 15:08 jeertmans

Not sure which "kind" are you referring to.. If you're talking about \ps then i think yes, because lexer tests (which include multiple languages and symbols, and combinations of them) i've written for the above-mentioned regex pass just fine.

EDIT: Specifically, this test does pass:

#[test]
fn test_identifier() {
    test_eq!("x", Token::Identifier("x"));
    test_eq!("xyz", Token::Identifier("xyz"));
    test_eq!("XYZ", Token::Identifier("XYZ"));
    test_eq!("X1", Token::Identifier("X1"));
    test_eq!("X1X", Token::Identifier("X1X"));
    test_eq!("X_", Token::Identifier("X_"));
    test_eq!("X_X", Token::Identifier("X_X"));
    test_eq!("_X", Token::Identifier("_X"));
    test_eq!("X1_", Token::Identifier("X1_"));
    test_eq!("X_1", Token::Identifier("X_1"));
    test_eq!("X_1X", Token::Identifier("X_1X"));
    test_eq!("X1_X", Token::Identifier("X1_X"));
    test_eq!("X__X", Token::Identifier("X__X"));
    test_eq!("X_X1", Token::Identifier("X_X1"));

    test_eq!("你", Token::Identifier("你"));
    test_eq!("你好", Token::Identifier("你好"));
    test_eq!("你1", Token::Identifier("你1"));
    test_eq!("你1你", Token::Identifier("你1你"));
    test_eq!("你_", Token::Identifier("你_"));
    test_eq!("你_你", Token::Identifier("你_你"));
    test_eq!("_你", Token::Identifier("_你"));
    test_eq!("你1_", Token::Identifier("你1_"));
    test_eq!("你_1", Token::Identifier("你_1"));
    test_eq!("你_1你", Token::Identifier("你_1你"));
    test_eq!("你1_你", Token::Identifier("你1_你"));
    test_eq!("你__你", Token::Identifier("你__你"));
    test_eq!("你_你1", Token::Identifier("你_你1"));
    
    test_eq!("п", Token::Identifier("п"));
    test_eq!("привет", Token::Identifier("привет"));
    test_eq!("ПРИВЕТ", Token::Identifier("ПРИВЕТ"));
    test_eq!("П1", Token::Identifier("П1"));
    test_eq!("П1П", Token::Identifier("П1П"));
    test_eq!("П_", Token::Identifier("П_"));
    test_eq!("П_П", Token::Identifier("П_П"));
    test_eq!("_П", Token::Identifier("_П"));
    test_eq!("П1_", Token::Identifier("П1_"));
    test_eq!("П_1", Token::Identifier("П_1"));
    test_eq!("П_1П", Token::Identifier("П_1П"));
    test_eq!("П1_П", Token::Identifier("П1_П"));
    test_eq!("П__П", Token::Identifier("П__П"));
    test_eq!("П_П1", Token::Identifier("П_П1"));
}

Token::Identifier is defined as follows:

    #[regex(r"(\p{XID_Start}|_)\p{XID_Continue}*")]
    Identifier(&'a str),

test_eq is just a custom convenience macro that builds a Lexer from an input string, runs it, collects all Oks from it into a Vec and compares with another Vec constructed from the rest of macro arguments.

elenakrittik avatar Aug 02 '23 16:08 elenakrittik

Can this be caused by the fact that we use a previous version of regex-syntax? Can you try using logos from #320?

jeertmans avatar Aug 02 '23 17:08 jeertmans

Sorry for the delay.

Assuming this input:

someident
_someident

..this logos dependency:

logos = { git = "https://github.com/jeertmans/logos.git", branch = "bump-regex-syntax" }

..and this token definition:

    #[regex(r"\p{XID_Start}\p{XID_Continue}*")]
    Identifier(&'a str),

Running logos yields the following result (parse currently sorts all errors and token by their spans and then prints them in order):

    Running `target/debug/gdtk parse --file incremental.gd`
Identifier("someident")
Newline
error: Unknown character.
--> incremental.gd:10..11
Identifier("someident")

So it seems regex-syntax defers the job of transforming \ps to the actual regex "executor", and that "executor" mistakenly (or not?) does not count underscore as a valid XID_Start character.

elenakrittik avatar Aug 10 '23 11:08 elenakrittik

That's definitely a bug because executing the following code

use regex_syntax::parse;

fn main() {
    let regex = r"(\p{XID_Start}|_)\p{XID_Continue}*";
    let hir = parse(regex).unwrap();
    
    println!("{hir:#?}");
}

on the rust playground yields a seemingly correct HIR, which is used to construct Logo's MIR.

jeertmans avatar Aug 10 '23 12:08 jeertmans

Perhaps you wanted to test the original regex? In your snippet you're using an alternation to include underscore, but parsing \p{XID_Start} by itself still does not emit underscore as far as i can tell. (so maybe that is a bug, but in regex-syntax and not logos?)

elenakrittik avatar Aug 10 '23 15:08 elenakrittik

Oh yeah, I didn't notice ...|_) was actually a fix.

But are you sure XID_Start includes the underscore? I cannot see that anywhere on https://unicode.org/reports/tr31/#Table_Lexical_Classes_for_Identifiers.

For example, the unicode-ident crate, specialized for XID_Start and XID_Continue, returns false when checking if _ matches XID_Start.

use unicode_ident::*;

fn main() {
    let b = is_xid_start('_');
    
    println!("{b}");
}

See rust playground.

jeertmans avatar Aug 10 '23 15:08 jeertmans

mistakenly (or not?)

I were not sure whether this is an intended behaviour or a bug in any library in the supply chain. With your comment it seems like underscore indeed is not part of the XID_Start group (which feels strange, since most, if not all, languages allow identifiers to start with underscore). Due to the reason inside parentheses, i think it will be right to ask @CAD97 to edit their regex to include underscore alternation so that less people coming across this thread fall into this trap.

elenakrittik avatar Aug 10 '23 16:08 elenakrittik

Well maybe it’s worth creating an issue on the regex-syntax crate :) But if that’s handwritten in the Unicode rules, I don’t think that will ever change 😅

jeertmans avatar Aug 10 '23 20:08 jeertmans

Unicode® Standard Annex #31 UNICODE IDENTIFIER AND PATTERN SYNTAX §2.4 Specific Character Adjustments calls _ out specifically for being XID_Continue but not XID_Start, but commonly included into the Start class for identifiers, as well as in §1.2 Customization. Representing the Rust identifier syntax it is of course correct to include _ in Start. But for generic identifier syntax, it is better to use the unmodified character classes and point at UAX#31 such that language designers can make an informed decision. Implementors of an existing grammar are expected to know what the grammar they're implementing calls for.

CAD97 avatar Aug 10 '23 20:08 CAD97

Closing this as I feel this is now documented in the handbook. If needed, feel free to re-open this issue!

jeertmans avatar Feb 13 '24 09:02 jeertmans