npeg
npeg copied to clipboard
Input wanted: design and implement a proper API for code blocks
Here's a request to all NPeg users: I'm looking for some ideas and feedback for the proper design of the API for grammar code blocks. At this time the available functions feel ad hoc and scattered, and I don't dare to call this v1.0 before a better interface is in place.
Here is a list of things a code block capture can do, and how this works now:
-
Access captures: when a code block executes, all closed captures that were made inside the rule, and the code block capture itself are available to the code block as the variable
capture
which is of typeseq[Capture]
. TheCapture
object has a few public members that can be used from the code block:s
is the captured string andsi
is the index into the original subject where the capture was matched. The capture stringss
are also accessible through a sugar macro that rewrites the code block as$n
:-
capture[0]
or$0
is the total capture of the rule -
capture[1..]
or$1
are the explicit string captures inside the rule
-
-
Force a rule to fail:
- code blocks have access to a proc
fail()
which will cause the match to fail - this can be used to validate matches with Nim code, for example to check integer ranges, enum values, or symbols that were captured earlier. - code blocks have acecss to a proc
validate(b: bool)
, which is nothing more thenif not b: fail()
For example:
- code blocks have access to a proc
uint8 <- >Digit[1..3]:
let v = parseInt($a)
validate v>=0 and v<=255
- Perform matching in Nim code: the internal match state is implicitly available to the code block capture, which can (ab)use that to do matching in Nim code: the code block can see and adjust the match index, and indicate success or failure, for example:
let p = peg foo:
foo <- "ba" * die * "da"
die <- 0:
if si < s.len-3 and if cast[string](s[si..si+2]) == "die":
si += 3
else:
fail()
echo p.match("badieda")
All of the above features were introduced in NPeg one at a time, growing the API lacking a proper design.
My question: does the above feel as kludgy as I think it is, or is it fine as it is now? Any ideas for improvement or what this should look like?
I'd like a hook which allows code execution when a rule is looked up. This way you can control when your code will be executed.
let p = peg foo:
onLookUp:
# (...)
onMatch:
# (...)
Yeah, this is one of my nitpicks as well. I usually solve this with always-match rules in the grammar like so:
foo <- fooStart * fooBody * fooEnd
fooStart <- 0:
echo "start"
fooEnd:
echo "done"
I do like your idea, but I'm afraid it will add a lot of noise for the cases where only the match code block is required. Maybe with some macro magic we can make it so that wihout explicit hook the block acts like the onMatch, keeping backwards compatibility, and only when there are hooks defined, they are used. Let me see how that would fit in.
Suggestions:
- Rather than expose parser state as multiple variables, expose it as a single variable. This would be consistent with how the capture data is presented. Alternately, the parser state could be represented as a variable, and the capture would be a field in the parser state.
- For captures in code blocks, only create the captured string on demand.
- Use more descriptive names than
s
,si
, etc. for variables and members. - Allow the code block to optionally use return statements to indicate whether a match succeeded or failed. That being said, such a change might require that all code blocks have a return statement.
Note that I haven't really evaluated the feasibility for the above suggestions. It's been a while since I've looked at Npeg's implementation, and I wanted to get down these thoughts while they were fresh in my mind.
Quoting Varriount (2019-12-03 15:59:48)
- Rather than expose parser state as multiple variables, expose it as a single variable. This would be consistent with how the capture data is presented. Alternately, the parser state could be represented as a variable, and the capture would be a field in the parser state.
Yeah, I have moved stuff to and from that model a few times. The state itself is in an object, but everything is moved to local variables during the match for performance reasons. I'm also not sure how to expose the subject string in a different variable, as that would require a copy of the subject, which is potentially large. I shall do some tests with shallow() to see if that helps.
- Use more descriptive names than
s
,si
, etc. for variables and members.
Sure, the current variables are bad but short because of the implementation. the also have a high chance of clashing with code in the blocks, this already happens to me too often.
- Allow the code block to optionally use return statements to indicate whether a match succeeded or failed. That being said, such a change might require that all code blocks have a return statement.
Nope, it will just return the default value for the type then. I did some tests with making the code blocks procs, but I can not find a way to keep the compiler happy with captures and closers and gc safety. I alreday do some nasty tricks with injects and identifiers to pass stuff around, and i was hoping someone who really understands this all better would one day point me out how to actually do that proper
Note that I haven't really evaluated the feasibility for the above suggestions. It's been a while since I've looked at Npeg's implementation, and I wanted to get down these thoughts while they were fresh in my mind.
No problem, any input is valuable. It will take me a few tries probably to get this right anyway, all of the above are good points. thanks!
-- :wq ^X^Cy^K^X^C^C^C^C
Yeah, I have moved stuff to and from that model a few times. The state itself is in an object, but everything is moved to local variables during the match for performance reasons. I'm also not sure how to expose the subject string in a different variable, as that would require a copy of the subject, which is potentially large. I shall do some tests with shallow() to see if that helps.
Could a "subject" procedure be made for the capture object, allocates the captured string when accessed? Due to UFCS, it would still look like attribute access:
let s = capture.subject
Nope, it will just return the default value for the type then. I did some tests with making the code blocks procs, but I can not find a way to keep the compiler happy with captures and closers and gc safety.
What about passing in the parser state and capture object as parameters, and marking the function as both inline and giving it a regular calling convention?
proc match(s: string) =
proc inner(parser: Parser, capture: Capture): bool {.inline, nimcall.} =
discard
You might be able to get rid of the inline
pragma, since compilers are usually pretty good about inlining small functions anyway.
What about passing in the parser state and capture object as parameters, and marking the function as both inline and giving it a regular calling convention?
Seems like a sane approach, I'll give it another try. Thanks for pointing that out!
-- :wq ^X^Cy^K^X^C^C^C^C
I suggest to use the return of the code block to pass a parsed value to the upper rule/capture. Add a compiler flag to avoid code blocks to execute when they fail.
This is essential for anyone to construct a custom AST without using hacks to catch the proper tree context. Without this, code blocks are useless when trying to construct an AST.
Use a tree structure to cache captures (instead of the list you are using), so you can easily remove entire failed branches and detect the tree context. Execute code blocks when the full match is completed and ok.
A better alternative to the compiler flag can be: pass a object to the code block where you can set two callback functions.
- validate: executed on every match like now. And where you can use fail() and validate().
- commit: executed on commit, when the full match is done. You can set the result value to pass a parsed object to the upper rule, like a custom object, instead of using push() with limited strings.
And, add an option in the parser to disable backtracing for languages that don't require it. You can execute both callbacks in the same pass in this option.
I suggest to use the return of the code block to pass a parsed value to the upper rule/capture.
Do you mean when captures are nested? Because apart from that there is no apparent upper/lower or parent/child relationship between rules at this time.
Add a compiler flag to avoid code blocks to execute when they fail.
But what if we want to use a code block to do the validation and pass/fail from it's result?
This is essential for anyone to construct a custom AST without using hacks to catch the proper tree context. Without this, code blocks are useless when trying to construct an AST.
Let's call it "suboptimal", not useless - see the examples with perfectly functional parsers.
Use a tree structure to cache captures (instead of the list you are using), so you can easily remove entire failed branches and detect the tree context.
But what would dictate the tree structure? The input is just a linear stream, where do the relationships come from?
Execute code blocks when the full match is completed and ok.
See above: code blocks have multiple uses, one of those is using Nim code to do manual parsing and let the parser fail or succeed from there.
I feel you are looking for a way to let NPeg have a sense of AST in its parser internally, but at this time the parsing is just sequential: seq-of-chars go in, seq-of-captures comes out. Structure is added by the user of NPeg by building AST in captures. Note that NPeg did have this kind of functionality in the past (see below), but I decided to deprecate this - it proved not flexible enough because the order of nodes in the stream often does not match the order of the nodes as they should end up in the AST tree, thus the user ends up with a generic tree where things are dangling in the wrong place.
Documentation of deprecated AST (Abstract Syntax Tree) captures
Note: AST captures is an experimental feature, the implementation or API might change in the future.
NPeg has a simple mechanism for storing captures in a tree data structure, allowing building of abstract syntax trees straight from the parser.
The basic AST node has the following layout:
ASTNode* = ref object
id*: string # user assigned AST node ID
val*: string # string capture
kids*: seq[ASTNode] # child nodes
To parse a subject and capture strings into a tree of ASTNode
s, use the
A(id, p)
operator:
- The
A(id, p)
operator creates a newASTNode
with the given identifier - The first string capture (
>
) inside patternp
will be assigned to theval
field of the AST node - All nested
ASTnode
s in patternp
will be added to the nodeskids
seq.
The following snippet shows an example of creating an AST from arithmetic expressions while properly handling operator precedence;
type Kind* = enum kInt, kAdd, kSub, kMul, kDiv
let s = peg "line":
line <- exp * !1
number <- A(kInt, >+Digit)
add <- A(kAdd, term * '+' * exp)
sub <- A(kSub, term * '-' * exp)
mul <- A(kMul, factor * '*' * term)
divi <- A(kDiv, factor * '/' * term)
exp <- add | sub | term
term <- mul | divi | factor
factor <- number | '(' * exp * ')'
let r = s.match("1+2*(3+4+9)+5*6")
let ast = r.capturesAST()
echo ast
This will generate an AST tree with the following layout:
+
/ \
1 +
/ \
/ \
/ *
* / \
/ \ 5 6
2 +
/ \
3 +
/ \
4 9
A better alternative to the compiler flag can be: pass a object to the code block where you can set two callback functions. 1. validate: executed on every match like now. And where you can use fail() and validate().
2. commit: executed on commit, when the full match is done.
That's not a bad idea, and with a bit of syntactic sugar this could be nicely wrapped up in some blocks instead of explicit callbacks. Something like:
import npeg
let p = peg "flop":
flop <- one * two | one * three
one <- '1':
match:
echo "one?"
commit:
echo "one!"
two <- '2':
echo "two"
three <- '3':
echo "three"
And, add an option in the parser to disable backtracing for languages that don't require it. You can execute both callbacks in the same pass in this option.
You can set the result value to pass a parsed object to the upper rule, like a custom object, instead of using push() with limited strings.
This is still a bit of a mismatch with how NPeg works at this time: there is no concept of an "upper rule" - all rules are sequential and there is no tree representation. In the message above you can see this could work by making the tree explicit in the syntax, but in practice this proves of limited use because the generated AST can only mirror the exact order of the input stream.
I didn't know about the A operator, but I don't believe it's a good solution because it doesn't allow for arbitrary code execution and for custom ASTs.
After analyzing the NPeg code I believe I have a more concrete proposal with no impact on the current spec. But you have to fill and check the details.
-
First, remove
capStack
andCapFrame
. You need a stack at the rule level instead, withruleStack
andRuleFrame
.fixCaptures
andcollectCaptures
are code hacks because you're squishing a square into a circle. TheRuleFrame
stores a sequence ofcaptures: seq[Capture]
and the context capturectx: Capture
. It also haserror: string
andok: bool
fields for error handling. -
You need something like
opRuleOpen
andopRuleClose
to replace thoseopCapOpen
andopCapClose
. And just have aopCapture
to replace the execution of a captured rule. -
I propose a set of context functions with the signatures:
template fail(msg = "")
template check(expr: bool, msg = "")
template capture[T](`s`: T)
Deprecate the old ones, but they should still work.
-
fail
sets theRuleFrame.error
andRuleFrame.ok
. -
check
conditionally sets theRuleFrame.error
andRuleFrame.ok
. -
capture
sets theRuleFrame.ctx
.
-
The
Capture
can accept values for T or seq[T]. -
Expose captured parameters with internal casts, ex:
%x
or something else. -
Expose a new method
ast()
for the parser that returns the top level capture (T or seq[T]) .
Note that, the ruleStack
is somehow redundant, since its boundaries map to rule calls, so we could use the program stack instead. However, since there are inlined rules, it's better to have it.
Grabbing the doc example:
type MyNode* = object
key*: string
val*: int
let parser = peg("pairs"):
pairs <- >pair * *(',' * >pair) * !1:
# do something with the captured result. Ideally it should be of type seq[MyNode]
word <- +Alpha
number <- +Digit
pair <- >word * '=' * >number:
let node = MyNode(key: %1, val: parseInt(%2))
capture(node)
The algorithm execution goes like:
- At
opRuleOpen
push aRuleFrame
toruleStack
- In each
opCapture
in the rule frame, grab the captured result (evaluated from the rule) and add it to theRuleFrame.captures
. Only if the rule has not failed. - At
opRuleClose
execute the code block where %x vars are mapped toRuleFrame.captures
. Pop theRuleFrame
and returnRuleFrame.ctx
. Ifctx
has no value, return theRuleFrame.captures
. If the rule has no captures, it's considered a PEG leaf, return the respective string value.
Improvements:
- It's compatible with current API.
- Failed rules will be automatically discarded. The final custom AST is correct.
- Outputs a validated and custom AST model.
- Seamless mapping between PEG grammar and internal structures. No
fixCaptures
andcollectCaptures
code hacks. You just pass the captures as it is. I expect performance improvements, sincefixCaptures
looks like an expensive execution.
Challenges:
- Providing the correct cast type for %x parameters may be challenging. Ideally it should return the expected type, but may not be possible.
- Ideally, a rule of "ordered choices" should return an enum with the expected types, but may not be possible.
- I propose a first version where casts have to be done by the user.
Thanks for giving this a good think. I'll have to go through this a few times and properly digest it, I'll probably be back with questions.
@shumy-tools One thing I would like to point out is that what one might consider the optimal layout/ruleset for a grammar (in terms such as readability, performance, etc.) may not match up with their AST (also, there are uses for NPeg other than AST building). I don't imagine this to be too common an occurrence, but it is possible.
@zevv Hi, I've been running into issues with limitations around code blocks: specifically because they're executed regardless of whether they're matched or not. A particular example I had trouble with was implementing a URI parser (not just validator), where I had quite a lot of backtracking due to some optional blocks. I don't know if the backtracking could be eliminated by proper use of precedence rules (I've yet to understand those) but figured I'd share what I ended up going with to sort of highlight what I find difficult to do in the current model.
Example of a URI PEG
Note the gross trick this PEG is pulling: it checks if optional captures were matched by checking the length of capture[string]
: but as this is only unambiguous with a single optional capture, it has to break the rules down into pairs of two. It then resets fields of the global object to their default value if the optional captures were not matched.
let parser = peg("uri", url: Url):
unreserved <- {'a'..'z', '0'..'9', '-', '.', '_', '~'}
gendelims <- {':', '/', '?', '#', '[', ']', '@'}
subdelims <- {'!', '$', '&', '\'', '(', ')', '*', '+', ',', ';', '='}
reserved <- gendelims | subdelims
percentenc <- "%" * Xdigit[2]
pathchars <- unreserved | percentenc | subdelims
decpiece <- Digit |
("1" * Digit * Digit) |
("2" * {'0'..'4'} * Digit) |
("25" * {'0'..'5'})
ipv4addr <- decpiece * "." * decpiece * "." * decpiece * "." * decpiece
ipv6piece <- {'0'..'9', 'a'..'f'}[1..4]
ipv6addr <- (ipv6piece * ":")[7] * ipv6piece |
(ipv6piece * ":")[0..6] * ":" * (":" * ipv6piece)[1] |
(ipv6piece * ":")[0..5] * ":" * (":" * ipv6piece)[2] |
(ipv6piece * ":")[0..4] * ":" * (":" * ipv6piece)[3] |
(ipv6piece * ":")[0..3] * ":" * (":" * ipv6piece)[4] |
(ipv6piece * ":")[0..2] * ":" * (":" * ipv6piece)[5] |
(ipv6piece * ":")[0..1] * ":" * (":" * ipv6piece)[6] |
":" * (":" * ipv6piece)[7]
scheme <- Alpha * *(Alpha | Digit | '+' | '-' | '.'):
url.scheme = $0
userinfo <- *(pathchars | ":"):
url.userinfo = $0
host <- ipv4addr | "[" * ipv6addr * "]" | *pathchars:
url.host = $0
port <- *Digit:
url.port = parseInt($0)
hostport <- host * ?(":" * >port):
if capture.len != 2:
url.port = 0
authority <- ?(>userinfo * "@") * hostport:
url.authority = $0
if capture.len != 2:
url.userinfo = ""
path <- (?("/") * *(+pathchars * *("/" * *pathchars))):
url.path = $0
relative <- ("//" * >authority * path) | path:
if capture.len != 2:
url.authority = ""
url.host = ""
query <- *(pathchars | ":" | "@" | "/" | "?"):
url.query = $0
fragment <- *(pathchars | ":" | "@" | "/" | "?"):
url.fragment = $0
schemerelative <- ?(>scheme * ":") * relative:
if capture.len != 2: url.scheme = ""
schemerelativequery <- schemerelative * ?("?" * >query):
if capture.len != 2: url.query = ""
uri <- schemerelativequery * ?("#" * >fragment):
if capture.len != 2: url.fragment = ""
I've been thinking a little bit about how to solve this: because the simple "don't execute code blocks for stuff that isn't matched" isn't a great solution: since one purpose of code blocks is to further restrict matches, it'd lead to a bunch of implicit ordering rules.
My second idea might work. Essentially, I'd like it if code blocks and captures always returned a type: when there are no captures, this is simply a string ($0
), when there are captures, this varies from Option[string] to seq[string] to others. Then, captures are treated as a fixed set of parameters to a code block. This then allows for propagating matched and parsed captures up to the main rule: where they can safely be returned. This would break all existing nPEGs.
let parser = peg:
rule <- expression: Type =
# arbitrary code execution
return capture
This would still allow the above example - mutating a global-ish variable - but it'd be an antipattern and better solved by a sequence of returning types. For an example of what this would look like, I rewrote the original PEG I had problems with above below.
Example with code blocks returning types
import questionable
let parser = peg("uri"):
unreserved <- {'a'..'z', '0'..'9', '-', '.', '_', '~'}
gendelims <- {':', '/', '?', '#', '[', ']', '@'}
subdelims <- {'!', '$', '&', '\'', '(', ')', '*', '+', ',', ';', '='}
reserved <- gendelims | subdelims
percentenc <- "%" * Xdigit[2]
pathchars <- unreserved | percentenc | subdelims
decpiece <- Digit |
("1" * Digit * Digit) |
("2" * {'0'..'4'} * Digit) |
("25" * {'0'..'5'})
ipv4addr <- decpiece * "." * decpiece * "." * decpiece * "." * decpiece
ipv6piece <- {'0'..'9', 'a'..'f'}[1..4]
ipv6addr <- (ipv6piece * ":")[7] * ipv6piece |
(ipv6piece * ":")[0..6] * ":" * (":" * ipv6piece)[1] |
(ipv6piece * ":")[0..5] * ":" * (":" * ipv6piece)[2] |
(ipv6piece * ":")[0..4] * ":" * (":" * ipv6piece)[3] |
(ipv6piece * ":")[0..3] * ":" * (":" * ipv6piece)[4] |
(ipv6piece * ":")[0..2] * ":" * (":" * ipv6piece)[5] |
(ipv6piece * ":")[0..1] * ":" * (":" * ipv6piece)[6] |
":" * (":" * ipv6piece)[7]
scheme <- Alpha * *(Alpha | Digit | '+' | '-' | '.')
userinfo <- *(pathchars | ":")
host <- ipv4addr | "[" * ipv6addr * "]" | *pathchars
port <- >*Digit: int =
return parseInt($1)
authority <- ?(>userinfo * "@") * >host * ?(":" * >port): (string, ?string, string, ?int) =
return ($0, $1, $2, $3)
path <- (?("/") * *(+pathchars * *("/" * *pathchars)))
query <- *(pathchars | ":" | "@" | "/" | "?")
fragment <- *(pathchars | ":" | "@" | "/" | "?")
uri <- ?(>scheme * ":") * ?("//" * >authority) * >path * ?("?" * >query) * ?("#" * >fragment): Url =
assert $1 is ?string and $2 is ?(string, ?string, string, ?int)
assert $3 is string and $4 is ?string and $5 is ?string
return Url(scheme: $1 |? "", authority: $2.?0 |? "",
userinfo = $2.?1 |? "", host = $2.?2, port = $2.?3 |? 0
path: $3, query: $4 |? "", fragment: $5 |? "")
To go into more detail:
Instead of having a capture: seq[string]
: captures captured by >
are provided as a fixed-size tuple eg. capture: (T, U, V)
. This is made possible by the following changes to capture rules:
- Matches within
?
,|
, or*
expressions asOption[T]
- Matches on a
()
grouping asstring
- Sequence concatenation occurs as normal.
- While captures return types: they only return them if individually captured, and this doesn't affect matching.
- Matches on an existing rule eg.
>rule
asT
: whereT
is the type of the rule.- However: the rule still consumes what it would normally consume.
- Eg. you can have a rule consume a string and return an int
- all other matches as
string
orchar
as appropriate
I'm not sure how much I like my second proposal. It's pretty syntactically dense, albeit with mostly standard syntax and significantly aided by questionable. The idea of rules having a type that can then be captured has stuck in my head though: I don't know what the nicest cleanest way to do it would be. Sorry if this is a bit of a brain dump, I don't have all the semantics of this approach thought through, but wanted to get them down somewhere.
First, I would like to thank you for creating the powerful npeg project. It is really intuitive and easy to understand.
After using it for a while, I found this situation, which led to this issue. Here is my immature idea, and referring to the above discussion, I think it is indeed necessary to represent each capture as tuple
, or, seq[seq[T]]
.
import npeg
type
TokenKind = enum
Id, Ellipsis, Comma
Token = ref object
kind: TokenKind
str: string
Context = ref object
proc `==`(t: Token, tk: TokenKind): bool = t.kind == tk
let lexer = peg(input, tokens: seq[Token]):
id <- +Alpha:
tokens.add Token(kind: Id, str: $0)
ellipsis <- "...":
tokens.add Token(kind: Ellipsis, str: $0)
comma <- *Blank * >',' * *Blank:
tokens.add Token(kind: Comma, str: $1)
args <- id * *(comma * id) * ?(comma * ellipsis)
input <- '(' * ?args * ')' * !1
let parser = peg(input, Token, s: seq[string]):
arg <- [Id]:
s.add ($0).str
comma <- [Comma]:
s.add ($0).str
ellipsis <- [Ellipsis]:
s.add ($0).str
args <- arg * *(comma * arg)
argList <- args * ?(comma * ellipsis)
input <- argList * !1
var tokens: seq[Token]
doAssert lexer.match("(a, ...)", tokens).ok
var s: seq[string]
doAssert parser.match(tokens, s).ok
In the lexer, I do not want to re-analyze the indeterminate-length result of id * *(comma * id)
. Therefore, I capture each segment in id
and comma
. However, because there is still ?(comma * ellipsis)
later, a traceback will be generated before encountering .... This caused two commas to appear between a
and ...
, which would fail when input into the parser. If *(comma * id)
and ?(comma * ellipsis)
can be represented as seq[string]
, this condition should be avoided.
In addition, in the lexer
, the length of the string can be changed to represent the match result when encountering the pattern of *(comma * id)
. However, in the parser
, because the input is of type Token
, it is impossible to represent the range of the match without using seq[Token]
.
This might help to access indeterminate-length result, but not the issues generated by traceback.