logos
logos copied to clipboard
Common regexes for the Logos Handbook
Not to go offtopic in #132, but it's pretty common that someone ends up reinventing a regex for something that can be done better, or has problems finding one that works (#124). This is doubly problematic since Logos doesn't support non-greedy matching (for good reasons). There is a lot of overlap when lexing programming languages on things like:
- Comments
- Quoted strings
- Floating point numbers with scientific notation
I already started noodling on a book to supplement the API docs, and I plan on having a chapter with commonly used regexes so people can look things up quickly. If you can think of something that might be a common use case, please add it in the comments below so it can be included in the book.
It's probably worth including an example for lexing (most of?) Rust's lexical structure, because that serves as a common baseline that users of logos probably understand. That said, here's a dump of the regex I've been using for the tokens I think are likely to be more generally reusable:
- line comment (excluding trailing newline):
//[^\n]*
- line comment (including trailing newline):
//[^\n]*\n?
- block comment (unnested):
/\*(?:[^*]|\*[^/])*\*/
- identifier (UAX#31):
\p{XID_Start}\p{XID_Continue}*
- identifier (Rust):
[\p{XID_Start}_]\p{XID_Continue}*
- identifier (traditional ASCII):
[_a-zA-Z][_0-9a-zA-Z]*
- binary integer:
0b_*[01][_01]*
- octal integer
0o_*[0-7][_0-7]*
- decimal integer:
[1-9][_1-9]*
- decimal float:
(?&digits)(?:e(?&digits)|\.(?&digits)(?:e(?&digits))?)
- string (minimal escapes):
"(?:[^"]|\\")*"
I regret nothing.
[+-]?(([0-9][_0-9]*\.([0-9][_0-9]*)?([eE][+-]?([0-9][_0-9]*))?[fFdD]?)|(\.([0-9][_0-9]*)([eE][+-]?([0-9][_0-9]*))?[fFdD]?)|(([0-9][_0-9]*)([eE][+-]?([0-9][_0-9]*))[fFdD]?)|(([0-9][_0-9]*)([eE][+-]?([0-9][_0-9]*))?[fFdD]))
This matches floating point java style literals.
So, it'll match stuff like:
2.2F
3.5F
.0
.2D
.2F
.2e2
-2e2
And reject stuff like:
1
.F
I found out about subpatterns about 3 hours ago and in that time I cleaned up the regexes I use.
use logos::Logos;
#[derive(Logos)]
#[logos(subpattern decimal = r"[0-9][_0-9]*")]
#[logos(subpattern hex = r"[0-9a-fA-F][_0-9a-fA-F]*")]
#[logos(subpattern octal = r"[0-7][_0-7]*")]
#[logos(subpattern binary = r"[0-1][_0-1]*")]
#[logos(subpattern exp = r"[eE][+-]?[0-9][_0-9]*")]
enum TokenKind {
#[regex(r"//.*\n?", logos::skip)]
#[regex(r"[ \t\n\f]+", logos::skip)]
#[error]
Error,
#[regex("(?&decimal)")]
Integer,
#[regex("0[xX](?&hex)")]
HexInteger,
#[regex("0[oO](?&octal)")]
OctalInteger,
#[regex("0[bB](?&binary)")]
BinaryInteger,
#[regex(r#"[+-]?(((?&decimal)\.(?&decimal)?(?&exp)?[fFdD]?)|(\.(?&decimal)(?&exp)?[fFdD]?)|((?&decimal)(?&exp)[fFdD]?)|((?&decimal)(?&exp)?[fFdD]))"#)]
Float,
#[regex(r"0[xX](((?&hex))|((?&hex)\.)|((?&hex)?\.(?&hex)))[pP][+-]?(?&decimal)[fFdD]?")]
HexFloat,
}
This isn't all of the tokens I use, but I feel like a lot of these are very common. (With the exception of HexFloat...)
It should be a good starting point for a lot of people wanting to not care about numbers (Specifically floats).
- block comment (unnested):
/*(?:[^*]|\*[^/])*\*/
This regex has a few issues:
- Missing
\
before the first*
- Fails to match
/***/
Here is the version I am using: /\*([^*]|\*+[^*/])*\*+/
(test cases)
@HactarCE I'm getting an [Error]
with that regex if I try to do something as simple as:
#[derive(Logos, Debug, PartialEq, Copy, Clone)]
pub(crate) enum CommentLexer {
#[regex(r"/\*([^*]|\*+[^*/])*\*+/")]
Comment,
#[error]
Error,
}
and try to lex /* basic */
(or any comment really).
Any idea what's wrong?
Edit: got it working via callbacks https://github.com/maciejhirsz/logos/issues/180#issuecomment-736401091
\p{XID_Start}\p{XID_Continue}*
Maybe that's intentional, but this regex does not match identifiers starting with underscore. I personally use (\p{XID_Start}|_)\p{XID_Continue}*
.
https://regex101.com/r/4gYRqp/1 https://regex101.com/r/z7glYD/1
Hello @elenakrittik, do you know if regex-syntax
supports that kind of regexes?
Not sure which "kind" are you referring to.. If you're talking about \p
s then i think yes, because lexer tests (which include multiple languages and symbols, and combinations of them) i've written for the above-mentioned regex pass just fine.
EDIT: Specifically, this test does pass:
#[test]
fn test_identifier() {
test_eq!("x", Token::Identifier("x"));
test_eq!("xyz", Token::Identifier("xyz"));
test_eq!("XYZ", Token::Identifier("XYZ"));
test_eq!("X1", Token::Identifier("X1"));
test_eq!("X1X", Token::Identifier("X1X"));
test_eq!("X_", Token::Identifier("X_"));
test_eq!("X_X", Token::Identifier("X_X"));
test_eq!("_X", Token::Identifier("_X"));
test_eq!("X1_", Token::Identifier("X1_"));
test_eq!("X_1", Token::Identifier("X_1"));
test_eq!("X_1X", Token::Identifier("X_1X"));
test_eq!("X1_X", Token::Identifier("X1_X"));
test_eq!("X__X", Token::Identifier("X__X"));
test_eq!("X_X1", Token::Identifier("X_X1"));
test_eq!("你", Token::Identifier("你"));
test_eq!("你好", Token::Identifier("你好"));
test_eq!("你1", Token::Identifier("你1"));
test_eq!("你1你", Token::Identifier("你1你"));
test_eq!("你_", Token::Identifier("你_"));
test_eq!("你_你", Token::Identifier("你_你"));
test_eq!("_你", Token::Identifier("_你"));
test_eq!("你1_", Token::Identifier("你1_"));
test_eq!("你_1", Token::Identifier("你_1"));
test_eq!("你_1你", Token::Identifier("你_1你"));
test_eq!("你1_你", Token::Identifier("你1_你"));
test_eq!("你__你", Token::Identifier("你__你"));
test_eq!("你_你1", Token::Identifier("你_你1"));
test_eq!("п", Token::Identifier("п"));
test_eq!("привет", Token::Identifier("привет"));
test_eq!("ПРИВЕТ", Token::Identifier("ПРИВЕТ"));
test_eq!("П1", Token::Identifier("П1"));
test_eq!("П1П", Token::Identifier("П1П"));
test_eq!("П_", Token::Identifier("П_"));
test_eq!("П_П", Token::Identifier("П_П"));
test_eq!("_П", Token::Identifier("_П"));
test_eq!("П1_", Token::Identifier("П1_"));
test_eq!("П_1", Token::Identifier("П_1"));
test_eq!("П_1П", Token::Identifier("П_1П"));
test_eq!("П1_П", Token::Identifier("П1_П"));
test_eq!("П__П", Token::Identifier("П__П"));
test_eq!("П_П1", Token::Identifier("П_П1"));
}
Token::Identifier
is defined as follows:
#[regex(r"(\p{XID_Start}|_)\p{XID_Continue}*")]
Identifier(&'a str),
test_eq
is just a custom convenience macro that builds a Lexer
from an input string, runs it, collects all Ok
s from it into a Vec
and compares with another Vec
constructed from the rest of macro arguments.
Can this be caused by the fact that we use a previous version of regex-syntax? Can you try using logos from #320?
Sorry for the delay.
Assuming this input:
someident
_someident
..this logos dependency:
logos = { git = "https://github.com/jeertmans/logos.git", branch = "bump-regex-syntax" }
..and this token definition:
#[regex(r"\p{XID_Start}\p{XID_Continue}*")]
Identifier(&'a str),
Running logos yields the following result (parse
currently sorts all errors and token by their spans and then prints them in order):
Running `target/debug/gdtk parse --file incremental.gd`
Identifier("someident")
Newline
error: Unknown character.
--> incremental.gd:10..11
Identifier("someident")
So it seems regex-syntax defers the job of transforming \p
s to the actual regex "executor", and that "executor" mistakenly (or not?) does not count underscore as a valid XID_Start character.
That's definitely a bug because executing the following code
use regex_syntax::parse;
fn main() {
let regex = r"(\p{XID_Start}|_)\p{XID_Continue}*";
let hir = parse(regex).unwrap();
println!("{hir:#?}");
}
on the rust playground yields a seemingly correct HIR, which is used to construct Logo's MIR.
Perhaps you wanted to test the original regex? In your snippet you're using an alternation to include underscore, but parsing \p{XID_Start}
by itself still does not emit underscore as far as i can tell. (so maybe that is a bug, but in regex-syntax and not logos?)
Oh yeah, I didn't notice ...|_)
was actually a fix.
But are you sure XID_Start
includes the underscore? I cannot see that anywhere on https://unicode.org/reports/tr31/#Table_Lexical_Classes_for_Identifiers.
For example, the unicode-ident
crate, specialized for XID_Start
and XID_Continue
, returns false
when checking if _
matches XID_Start
.
use unicode_ident::*;
fn main() {
let b = is_xid_start('_');
println!("{b}");
}
See rust playground.
mistakenly (or not?)
I were not sure whether this is an intended behaviour or a bug in any library in the supply chain. With your comment it seems like underscore indeed is not part of the XID_Start
group (which feels strange, since most, if not all, languages allow identifiers to start with underscore). Due to the reason inside parentheses, i think it will be right to ask @CAD97 to edit their regex to include underscore alternation so that less people coming across this thread fall into this trap.
Well maybe it’s worth creating an issue on the regex-syntax crate :) But if that’s handwritten in the Unicode rules, I don’t think that will ever change 😅
Unicode® Standard Annex #31 UNICODE IDENTIFIER AND PATTERN SYNTAX §2.4 Specific Character Adjustments calls _
out specifically for being XID_Continue but not XID_Start, but commonly included into the Start class for identifiers, as well as in §1.2 Customization. Representing the Rust identifier syntax it is of course correct to include _
in Start. But for generic identifier syntax, it is better to use the unmodified character classes and point at UAX#31 such that language designers can make an informed decision. Implementors of an existing grammar are expected to know what the grammar they're implementing calls for.
Closing this as I feel this is now documented in the handbook. If needed, feel free to re-open this issue!