Any (rudimentary) documentation and/or examples?
I'm mostly interested in how the patterns look like and how the "merge with EBNF" is being (planned to be) realized.
Hi, not yet, this is actually the part I am currently working on in this project. There is a lot to figure out here, and a lot of implications to various choices.
The general idea is that I'm starting with something like a traditional EBNF-like syntax:
Newline = "\r\n" | "\r" | "\n" ;
Space = " " | "\t" ;
Whitespace = Newline | Space ;
Assignment = "=" ;
Semicolon = ";" ;
Identifier = /[A-Z][[:alnum:]_]*/ ;
NestedPattern = "/", Pattern, "/" ;
Expr = NestedPattern | Identifier ;
ExprList = (ExprList, ",")? , Expr ;
Definition = Identifier , Whitespace+, Assignment, Whitespace+, ExprList, Semicolon ;
Grammar = (Definition | Expr | Whitespace+)+, EOF ;
You'll notice the above is pretty vanilla EBNF, except the definition of Identifier:
Identifier = /[A-Z][[:alnum:]_]*/ ;
As you can guess, the text between / /; is a regexp. There are some implications here (your regexp cannot contain literal unescaped /;\n at the least) which I'm still evaluating.
The other thought is that the above, on its own, is not actually a valid input. Instead, you must end an input with a main expression:
Newline = "\r\n" | "\r" | "\n" ;
Space = " " | "\t" ;
Whitespace = Newline | Space ;
Assignment = "=" ;
Semicolon = ";" ;
Identifier = /[A-Z][[:alnum:]_]*/ ;
NestedPattern = "/", Pattern, "/" ;
Expr = NestedPattern | Identifier ;
ExprList = (ExprList, ",")? , Expr ;
Definition = Identifier , Whitespace+, Assignment, Whitespace+, ExprList, Semicolon ;
Grammar = (Definition | Expr | Whitespace+)+, EOF ;
+Grammar;
The above saying Grammar is the main parser entrypoint expression, effectively.
Since it is an expression, a valid program would also be just:
/[A-Z][[:alnum:]_]*/
And here you start to get an idea of how one could start with regular expressions and begin to break those out into more EBNF-like definitions as your regexp gets more complex.
And of course, Zorex being built on a generalized LL parser actually means you're not restricted to regexp at all, but can devolve into parsing full left-and-right recursive context-free grammars as well as some context-sensitive ones.
However, this is all still very early stages, I haven't figured out a number of important things so this is more an experiment at this point.
I'm curious, do you have a use case for something like this? Are you looking for something that could do this, or just looking for a generic Zig regexp engine?
Thanks for a thorough answer and the outlook!
I'm curious, do you have a use case for something like this? Are you looking for something that could do this, or just looking for a generic Zig regexp engine?
No, I don't have any use case per se for this. I'm mainly curious how you're going to balance the 3 main components: syntax restrictions of Zig, regex syntax requirements, LL parser requirements. If this could be generalized or partially reused, I'd be interested in SLOC to get a glimpse how much complexity & effort is needed for such a tool. And lastly, I'm also curious how the performance will look like in practice and what are the practical limitations.
So, all in all I'm interested in pretty much the whole idea (which I find kind of novel in the world of AOT compiled languages) and all its implications :wink:.