npeg Input wanted: design and implement a proper API for code blocks

Here's a request to all NPeg users: I'm looking for some ideas and feedback for the proper design of the API for grammar code blocks. At this time the available functions feel ad hoc and scattered, and I don't dare to call this v1.0 before a better interface is in place.

Here is a list of things a code block capture can do, and how this works now:

Access captures: when a code block executes, all closed captures that were made inside the rule, and the code block capture itself are available to the code block as the variable capture which is of type seq[Capture]. The Capture object has a few public members that can be used from the code block: s is the captured string and si is the index into the original subject where the capture was matched. The capture strings s are also accessible through a sugar macro that rewrites the code block as $n:
- capture[0] or $0 is the total capture of the rule
- capture[1..] or $1 are the explicit string captures inside the rule
Force a rule to fail:
- code blocks have access to a proc fail() which will cause the match to fail - this can be used to validate matches with Nim code, for example to check integer ranges, enum values, or symbols that were captured earlier.
- code blocks have acecss to a proc validate(b: bool), which is nothing more then if not b: fail()
For example:

uint8 <- >Digit[1..3]:                                                     
  let v = parseInt($a)                                                         
  validate v>=0 and v<=255

Perform matching in Nim code: the internal match state is implicitly available to the code block capture, which can (ab)use that to do matching in Nim code: the code block can see and adjust the match index, and indicate success or failure, for example:

  let p = peg foo:
    foo <- "ba" * die * "da"
    die <- 0:
      if si < s.len-3 and if cast[string](s[si..si+2]) == "die":
        si += 3
      else:
        fail()

  echo p.match("badieda")

All of the above features were introduced in NPeg one at a time, growing the API lacking a proper design.

My question: does the above feel as kludgy as I think it is, or is it fine as it is now? Any ideas for improvement or what this should look like?

Oct 21 '19 17:10 zevv

I'd like a hook which allows code execution when a rule is looked up. This way you can control when your code will be executed.

let p = peg foo:
  onLookUp:
    # (...)
  onMatch:
    # (...)

Nov 26 '19 03:11 sealmove

Yeah, this is one of my nitpicks as well. I usually solve this with always-match rules in the grammar like so:

foo <- fooStart * fooBody * fooEnd
fooStart <- 0:
  echo "start"
fooEnd:
  echo "done"

I do like your idea, but I'm afraid it will add a lot of noise for the cases where only the match code block is required. Maybe with some macro magic we can make it so that wihout explicit hook the block acts like the onMatch, keeping backwards compatibility, and only when there are hooks defined, they are used. Let me see how that would fit in.

Nov 26 '19 05:11 zevv

Suggestions:

Rather than expose parser state as multiple variables, expose it as a single variable. This would be consistent with how the capture data is presented. Alternately, the parser state could be represented as a variable, and the capture would be a field in the parser state.
For captures in code blocks, only create the captured string on demand.
Use more descriptive names than s, si, etc. for variables and members.
Allow the code block to optionally use return statements to indicate whether a match succeeded or failed. That being said, such a change might require that all code blocks have a return statement.

Note that I haven't really evaluated the feasibility for the above suggestions. It's been a while since I've looked at Npeg's implementation, and I wanted to get down these thoughts while they were fresh in my mind.

Dec 03 '19 14:12 Varriount

Quoting Varriount (2019-12-03 15:59:48)

Rather than expose parser state as multiple variables, expose it as a single variable. This would be consistent with how the capture data is presented. Alternately, the parser state could be represented as a variable, and the capture would be a field in the parser state.

Yeah, I have moved stuff to and from that model a few times. The state itself is in an object, but everything is moved to local variables during the match for performance reasons. I'm also not sure how to expose the subject string in a different variable, as that would require a copy of the subject, which is potentially large. I shall do some tests with shallow() to see if that helps.

Use more descriptive names than s, si, etc. for variables and members.

Sure, the current variables are bad but short because of the implementation. the also have a high chance of clashing with code in the blocks, this already happens to me too often.

Allow the code block to optionally use return statements to indicate whether a match succeeded or failed. That being said, such a change might require that all code blocks have a return statement.

Nope, it will just return the default value for the type then. I did some tests with making the code blocks procs, but I can not find a way to keep the compiler happy with captures and closers and gc safety. I alreday do some nasty tricks with injects and identifiers to pass stuff around, and i was hoping someone who really understands this all better would one day point me out how to actually do that proper

Note that I haven't really evaluated the feasibility for the above suggestions. It's been a while since I've looked at Npeg's implementation, and I wanted to get down these thoughts while they were fresh in my mind.

No problem, any input is valuable. It will take me a few tries probably to get this right anyway, all of the above are good points. thanks!

-- :wq ^X^Cy^K^X^C^C^C^C

Dec 03 '19 15:12 zevv

Yeah, I have moved stuff to and from that model a few times. The state itself is in an object, but everything is moved to local variables during the match for performance reasons. I'm also not sure how to expose the subject string in a different variable, as that would require a copy of the subject, which is potentially large. I shall do some tests with shallow() to see if that helps.

Could a "subject" procedure be made for the capture object, allocates the captured string when accessed? Due to UFCS, it would still look like attribute access:

let s = capture.subject

Nope, it will just return the default value for the type then. I did some tests with making the code blocks procs, but I can not find a way to keep the compiler happy with captures and closers and gc safety.

What about passing in the parser state and capture object as parameters, and marking the function as both inline and giving it a regular calling convention?

proc match(s: string) =
  proc inner(parser: Parser, capture: Capture): bool {.inline, nimcall.} =
    discard

You might be able to get rid of the inline pragma, since compilers are usually pretty good about inlining small functions anyway.

Dec 03 '19 17:12 Varriount

What about passing in the parser state and capture object as parameters, and marking the function as both inline and giving it a regular calling convention?

Seems like a sane approach, I'll give it another try. Thanks for pointing that out!

-- :wq ^X^Cy^K^X^C^C^C^C

Dec 03 '19 18:12 zevv

I suggest to use the return of the code block to pass a parsed value to the upper rule/capture. Add a compiler flag to avoid code blocks to execute when they fail.

This is essential for anyone to construct a custom AST without using hacks to catch the proper tree context. Without this, code blocks are useless when trying to construct an AST.

Use a tree structure to cache captures (instead of the list you are using), so you can easily remove entire failed branches and detect the tree context. Execute code blocks when the full match is completed and ok.

Jun 15 '20 23:06 shumy

A better alternative to the compiler flag can be: pass a object to the code block where you can set two callback functions.

validate: executed on every match like now. And where you can use fail() and validate().
commit: executed on commit, when the full match is done. You can set the result value to pass a parsed object to the upper rule, like a custom object, instead of using push() with limited strings.

And, add an option in the parser to disable backtracing for languages that don't require it. You can execute both callbacks in the same pass in this option.

Jun 15 '20 23:06 shumy

I suggest to use the return of the code block to pass a parsed value to the upper rule/capture.

Do you mean when captures are nested? Because apart from that there is no apparent upper/lower or parent/child relationship between rules at this time.

Add a compiler flag to avoid code blocks to execute when they fail.

But what if we want to use a code block to do the validation and pass/fail from it's result?

This is essential for anyone to construct a custom AST without using hacks to catch the proper tree context. Without this, code blocks are useless when trying to construct an AST.

Let's call it "suboptimal", not useless - see the examples with perfectly functional parsers.

Use a tree structure to cache captures (instead of the list you are using), so you can easily remove entire failed branches and detect the tree context.

But what would dictate the tree structure? The input is just a linear stream, where do the relationships come from?

Execute code blocks when the full match is completed and ok.

See above: code blocks have multiple uses, one of those is using Nim code to do manual parsing and let the parser fail or succeed from there.

I feel you are looking for a way to let NPeg have a sense of AST in its parser internally, but at this time the parsing is just sequential: seq-of-chars go in, seq-of-captures comes out. Structure is added by the user of NPeg by building AST in captures. Note that NPeg did have this kind of functionality in the past (see below), but I decided to deprecate this - it proved not flexible enough because the order of nodes in the stream often does not match the order of the nodes as they should end up in the AST tree, thus the user ends up with a generic tree where things are dangling in the wrong place.

Documentation of deprecated AST (Abstract Syntax Tree) captures

Note: AST captures is an experimental feature, the implementation or API might change in the future.

NPeg has a simple mechanism for storing captures in a tree data structure, allowing building of abstract syntax trees straight from the parser.

The basic AST node has the following layout:

ASTNode* = ref object                                                       
  id*: string           # user assigned AST node ID 
  val*: string          # string capture                                  
  kids*: seq[ASTNode]   # child nodes

To parse a subject and capture strings into a tree of ASTNodes, use the A(id, p) operator:

The A(id, p) operator creates a new ASTNode with the given identifier
The first string capture (>) inside pattern p will be assigned to the val field of the AST node
All nested ASTnodes in pattern p will be added to the nodes kids seq.

The following snippet shows an example of creating an AST from arithmetic expressions while properly handling operator precedence;

type Kind* = enum kInt, kAdd, kSub, kMul, kDiv

let s = peg "line":
  line     <- exp * !1
  number   <- A(kInt, >+Digit)
  add      <- A(kAdd, term * '+' * exp)
  sub      <- A(kSub, term * '-' * exp)
  mul      <- A(kMul, factor * '*' * term)
  divi     <- A(kDiv, factor * '/' * term)
  exp      <- add | sub | term
  term     <- mul | divi | factor
  factor   <- number | '(' * exp * ')'

let r = s.match("1+2*(3+4+9)+5*6")
let ast = r.capturesAST()
echo ast

This will generate an AST tree with the following layout:

       +
      / \
     1   + 
        / \
       /   \
      /     *
     *     / \
    / \   5   6
   2   +
      / \
     3   +
        / \
       4   9

Jun 16 '20 06:06 zevv

A better alternative to the compiler flag can be: pass a object to the code block where you can set two callback functions. 1. validate: executed on every match like now. And where you can use fail() and validate().
2. commit: executed on commit, when the full match is done. 

That's not a bad idea, and with a bit of syntactic sugar this could be nicely wrapped up in some blocks instead of explicit callbacks. Something like:

import npeg

let p = peg "flop":

  flop <- one * two | one * three

  one <- '1':
    match:
      echo "one?"
    commit:
      echo "one!"

  two <- '2':
    echo "two"

  three <- '3':
    echo "three"

And, add an option in the parser to disable backtracing for languages that don't require it. You can execute both callbacks in the same pass in this option.

You can set the result value to pass a parsed object to the upper rule, like a custom object, instead of using push() with limited strings.

This is still a bit of a mismatch with how NPeg works at this time: there is no concept of an "upper rule" - all rules are sequential and there is no tree representation. In the message above you can see this could work by making the tree explicit in the syntax, but in practice this proves of limited use because the generated AST can only mirror the exact order of the input stream.

Jun 16 '20 06:06 zevv

I didn't know about the A operator, but I don't believe it's a good solution because it doesn't allow for arbitrary code execution and for custom ASTs.

After analyzing the NPeg code I believe I have a more concrete proposal with no impact on the current spec. But you have to fill and check the details.

First, remove capStack and CapFrame. You need a stack at the rule level instead, with ruleStack and RuleFrame. fixCaptures and collectCaptures are code hacks because you're squishing a square into a circle. The RuleFrame stores a sequence of captures: seq[Capture] and the context capture ctx: Capture. It also has error: string and ok: bool fields for error handling.
You need something like opRuleOpen and opRuleClose to replace those opCapOpen and opCapClose. And just have a opCapture to replace the execution of a captured rule.
I propose a set of context functions with the signatures:

template fail(msg = "")
template check(expr: bool, msg = "")
template capture[T](`s`: T)

Deprecate the old ones, but they should still work.

fail sets the RuleFrame.error and RuleFrame.ok.
check conditionally sets the RuleFrame.error and RuleFrame.ok.
capture sets the RuleFrame.ctx.

The Capture can accept values for T or seq[T].
Expose captured parameters with internal casts, ex: %x or something else.
Expose a new method ast() for the parser that returns the top level capture (T or seq[T]) .

Note that, the ruleStack is somehow redundant, since its boundaries map to rule calls, so we could use the program stack instead. However, since there are inlined rules, it's better to have it.

Grabbing the doc example:

type MyNode* = object                                                       
  key*: string
  val*: int

let parser = peg("pairs"):
  pairs <- >pair * *(',' * >pair) * !1:
    # do something with the captured result. Ideally it should be of type seq[MyNode]
  word <- +Alpha
  number <- +Digit
  pair <- >word * '=' * >number:
    let node = MyNode(key: %1, val: parseInt(%2))
    capture(node)

The algorithm execution goes like:

At opRuleOpen push a RuleFrame to ruleStack
In each opCapture in the rule frame, grab the captured result (evaluated from the rule) and add it to the RuleFrame.captures. Only if the rule has not failed.
At opRuleClose execute the code block where %x vars are mapped to RuleFrame.captures. Pop the RuleFrame and return RuleFrame.ctx. If ctx has no value, return the RuleFrame.captures. If the rule has no captures, it's considered a PEG leaf, return the respective string value.

Improvements:

It's compatible with current API.
Failed rules will be automatically discarded. The final custom AST is correct.
Outputs a validated and custom AST model.
Seamless mapping between PEG grammar and internal structures. No fixCaptures and collectCaptures code hacks. You just pass the captures as it is. I expect performance improvements, since fixCaptures looks like an expensive execution.

Challenges:

Providing the correct cast type for %x parameters may be challenging. Ideally it should return the expected type, but may not be possible.
Ideally, a rule of "ordered choices" should return an enum with the expected types, but may not be possible.
I propose a first version where casts have to be done by the user.

Jun 16 '20 16:06 shumy-tools

Thanks for giving this a good think. I'll have to go through this a few times and properly digest it, I'll probably be back with questions.

Jun 16 '20 17:06 zevv

@shumy-tools One thing I would like to point out is that what one might consider the optimal layout/ruleset for a grammar (in terms such as readability, performance, etc.) may not match up with their AST (also, there are uses for NPeg other than AST building). I don't imagine this to be too common an occurrence, but it is possible.

Jul 22 '20 20:07 Varriount

@zevv Hi, I've been running into issues with limitations around code blocks: specifically because they're executed regardless of whether they're matched or not. A particular example I had trouble with was implementing a URI parser (not just validator), where I had quite a lot of backtracking due to some optional blocks. I don't know if the backtracking could be eliminated by proper use of precedence rules (I've yet to understand those) but figured I'd share what I ended up going with to sort of highlight what I find difficult to do in the current model.

Example of a URI PEG

Note the gross trick this PEG is pulling: it checks if optional captures were matched by checking the length of capture[string]: but as this is only unambiguous with a single optional capture, it has to break the rules down into pairs of two. It then resets fields of the global object to their default value if the optional captures were not matched.

let parser = peg("uri", url: Url):
  unreserved <- {'a'..'z', '0'..'9', '-', '.', '_', '~'}
  gendelims  <- {':', '/', '?', '#', '[', ']', '@'}
  subdelims  <- {'!', '$', '&', '\'', '(', ')', '*', '+', ',', ';', '='}
  reserved   <- gendelims | subdelims
  percentenc <- "%" * Xdigit[2]
  pathchars  <- unreserved | percentenc | subdelims

  decpiece <- Digit |
    ("1" * Digit * Digit) |
    ("2" * {'0'..'4'} * Digit) |
    ("25" * {'0'..'5'})
  ipv4addr <- decpiece * "." * decpiece * "." * decpiece * "." * decpiece

  ipv6piece <- {'0'..'9', 'a'..'f'}[1..4]
  ipv6addr  <- (ipv6piece * ":")[7] * ipv6piece |
               (ipv6piece * ":")[0..6] * ":" * (":" * ipv6piece)[1] |
               (ipv6piece * ":")[0..5] * ":" * (":" * ipv6piece)[2] |
               (ipv6piece * ":")[0..4] * ":" * (":" * ipv6piece)[3] |
               (ipv6piece * ":")[0..3] * ":" * (":" * ipv6piece)[4] |
               (ipv6piece * ":")[0..2] * ":" * (":" * ipv6piece)[5] |
               (ipv6piece * ":")[0..1] * ":" * (":" * ipv6piece)[6] |
               ":" * (":" * ipv6piece)[7]

  scheme   <- Alpha * *(Alpha | Digit | '+' | '-' | '.'):
    url.scheme = $0
  userinfo <- *(pathchars | ":"):
    url.userinfo = $0
  host     <- ipv4addr | "[" * ipv6addr * "]" | *pathchars:
    url.host = $0
  port     <- *Digit:
    url.port = parseInt($0)

  hostport <- host * ?(":" * >port):
    if capture.len != 2:
      url.port = 0

  authority <- ?(>userinfo * "@") * hostport:
    url.authority = $0
    if capture.len != 2:
      url.userinfo = ""
  path      <- (?("/") * *(+pathchars * *("/" * *pathchars))):
    url.path = $0
  relative  <- ("//" * >authority * path) | path:
    if capture.len != 2:
      url.authority = ""
      url.host = ""

  query    <- *(pathchars | ":" | "@" | "/" | "?"):
    url.query = $0
  fragment <- *(pathchars | ":" | "@" | "/" | "?"):
    url.fragment = $0

  schemerelative <- ?(>scheme * ":") * relative:
    if capture.len != 2: url.scheme = ""

  schemerelativequery <- schemerelative * ?("?" * >query):
    if capture.len != 2: url.query = ""

  uri <- schemerelativequery * ?("#" * >fragment):
    if capture.len != 2: url.fragment = ""

I've been thinking a little bit about how to solve this: because the simple "don't execute code blocks for stuff that isn't matched" isn't a great solution: since one purpose of code blocks is to further restrict matches, it'd lead to a bunch of implicit ordering rules.

My second idea might work. Essentially, I'd like it if code blocks and captures always returned a type: when there are no captures, this is simply a string ($0), when there are captures, this varies from Option[string] to seq[string] to others. Then, captures are treated as a fixed set of parameters to a code block. This then allows for propagating matched and parsed captures up to the main rule: where they can safely be returned. This would break all existing nPEGs.

let parser = peg:
  rule <- expression: Type =
    # arbitrary code execution
    return capture

This would still allow the above example - mutating a global-ish variable - but it'd be an antipattern and better solved by a sequence of returning types. For an example of what this would look like, I rewrote the original PEG I had problems with above below.

Example with code blocks returning types

import questionable

let parser = peg("uri"):
  unreserved <- {'a'..'z', '0'..'9', '-', '.', '_', '~'}
  gendelims  <- {':', '/', '?', '#', '[', ']', '@'}
  subdelims  <- {'!', '$', '&', '\'', '(', ')', '*', '+', ',', ';', '='}
  reserved   <- gendelims | subdelims
  percentenc <- "%" * Xdigit[2]
  pathchars  <- unreserved | percentenc | subdelims

  decpiece <- Digit |
    ("1" * Digit * Digit) |
    ("2" * {'0'..'4'} * Digit) |
    ("25" * {'0'..'5'})
  ipv4addr <- decpiece * "." * decpiece * "." * decpiece * "." * decpiece

  ipv6piece <- {'0'..'9', 'a'..'f'}[1..4]
  ipv6addr  <- (ipv6piece * ":")[7] * ipv6piece |
               (ipv6piece * ":")[0..6] * ":" * (":" * ipv6piece)[1] |
               (ipv6piece * ":")[0..5] * ":" * (":" * ipv6piece)[2] |
               (ipv6piece * ":")[0..4] * ":" * (":" * ipv6piece)[3] |
               (ipv6piece * ":")[0..3] * ":" * (":" * ipv6piece)[4] |
               (ipv6piece * ":")[0..2] * ":" * (":" * ipv6piece)[5] |
               (ipv6piece * ":")[0..1] * ":" * (":" * ipv6piece)[6] |
               ":" * (":" * ipv6piece)[7]

  scheme   <- Alpha * *(Alpha | Digit | '+' | '-' | '.')
  userinfo <- *(pathchars | ":")
  host     <- ipv4addr | "[" * ipv6addr * "]" | *pathchars
  port     <- >*Digit: int =
    return parseInt($1)

  authority <- ?(>userinfo * "@") * >host * ?(":" * >port): (string, ?string, string, ?int) =
    return ($0, $1, $2, $3)
  path      <- (?("/") * *(+pathchars * *("/" * *pathchars)))

  query    <- *(pathchars | ":" | "@" | "/" | "?")
  fragment <- *(pathchars | ":" | "@" | "/" | "?")

  uri <- ?(>scheme * ":") * ?("//" * >authority) * >path * ?("?" * >query) * ?("#" * >fragment): Url =
    assert $1 is ?string and $2 is ?(string, ?string, string, ?int)
    assert $3 is string and $4 is ?string and $5 is ?string
    return Url(scheme: $1 |? "", authority: $2.?0 |? "",
      userinfo = $2.?1 |? "", host = $2.?2, port = $2.?3 |? 0
      path: $3, query: $4 |? "", fragment: $5 |? "")

To go into more detail:

Instead of having a capture: seq[string]: captures captured by > are provided as a fixed-size tuple eg. capture: (T, U, V). This is made possible by the following changes to capture rules:

Matches within ?, |, or * expressions as Option[T]

Matches on a () grouping as string
- Sequence concatenation occurs as normal.
- While captures return types: they only return them if individually captured, and this doesn't affect matching.
Matches on an existing rule eg. >rule as T: where T is the type of the rule.
- However: the rule still consumes what it would normally consume.
- Eg. you can have a rule consume a string and return an int
all other matches as string or char as appropriate

I'm not sure how much I like my second proposal. It's pretty syntactically dense, albeit with mostly standard syntax and significantly aided by questionable. The idea of rules having a type that can then be captured has stuck in my head though: I don't know what the nicest cleanest way to do it would be. Sorry if this is a bit of a brain dump, I don't have all the semantics of this approach thought through, but wanted to get them down somewhere.

Jun 01 '23 05:06 omentic

First, I would like to thank you for creating the powerful npeg project. It is really intuitive and easy to understand.

After using it for a while, I found this situation, which led to this issue. Here is my immature idea, and referring to the above discussion, I think it is indeed necessary to represent each capture as tuple, or, seq[seq[T]].

import npeg

type
  TokenKind = enum
    Id, Ellipsis, Comma
  Token = ref object
    kind: TokenKind
    str: string
  Context = ref object

proc `==`(t: Token, tk: TokenKind): bool = t.kind == tk

let lexer = peg(input, tokens: seq[Token]):
  id <- +Alpha:
    tokens.add Token(kind: Id, str: $0)
  ellipsis <- "...":
    tokens.add Token(kind: Ellipsis, str: $0)
  comma <- *Blank * >',' * *Blank:
    tokens.add Token(kind: Comma, str: $1)
  args <- id * *(comma * id) * ?(comma * ellipsis)
  input <- '(' * ?args * ')' * !1

let parser = peg(input, Token, s: seq[string]):
  arg <- [Id]:
    s.add ($0).str
  comma <- [Comma]:
    s.add ($0).str
  ellipsis <- [Ellipsis]:
    s.add ($0).str
  args <- arg * *(comma * arg)
  argList <- args * ?(comma * ellipsis)
  input <- argList * !1

var tokens: seq[Token]
doAssert lexer.match("(a, ...)", tokens).ok
var s: seq[string]
doAssert parser.match(tokens, s).ok

In the lexer, I do not want to re-analyze the indeterminate-length result of id * *(comma * id). Therefore, I capture each segment in id and comma. However, because there is still ?(comma * ellipsis) later, a traceback will be generated before encountering .... This caused two commas to appear between a and ..., which would fail when input into the parser. If *(comma * id) and ?(comma * ellipsis) can be represented as seq[string], this condition should be avoided.

In addition, in the lexer, the length of the string can be changed to represent the match result when encountering the pattern of *(comma * id). However, in the parser, because the input is of type Token, it is impossible to represent the range of the match without using seq[Token].

This might help to access indeterminate-length result, but not the issues generated by traceback.

Sep 11 '23 18:09 derekdai

npeg npeg copied to clipboard

Input wanted: design and implement a proper API for code blocks

Documentation of deprecated AST (Abstract Syntax Tree) captures

npeg
npeg copied to clipboard