npeg Question about code blocks and backtracking

You write:
"Note that for code block captures, the Nim code gets executed during parsing, even if the match is part of a pattern that fails and is later backtracked."

I have been modeling a semi-complicated Prolog like language with lots of committed choices (and hence lots of places for backtracking). My global state is a stack which during code-block execution is popped and pushed eventually rolling up leaf-nodes into higher grains (I'm trying to model my nodes as variant objects like the JSON parser does).

My question is multi-fold: 1) given your statement above that code gets executed even in dead-end parsing is there any API for knowing this and possibly allowing code blocks to commit/rollback the global state? 2) do you have any suggestions of a good technique for this (the best I can think of is a stack of stacks where the inner stacks represent speculative parses that can be dropped/committed) 3) am I overthinking this?

thanks, Daniel

Apr 04 '20 13:04 reverendpaco

My question is multi-fold: 1) given your statement above that code gets executed even in dead-end parsing is there any API for knowing this and possibly allowing code blocks to commit/rollback the global state?

That is actually a pretty good question, and I have to admit I never gave this much thought - I do see your point however, and think it might be worth spending some time on a proper solution for this.

do you have any suggestions of a good technique for this

Actually, no. When npeg decides to backtrack the user has simply no way of knowing this.

I think I never ran into this issue because in practice most languages simply are not ambigious enough to need much backtracking - stuff either matches or not. But of course NPeg aims to offer a generic solution to fit a lot of needs, so lets see if we can make this work.

What would help me here is if you can give me a small, minimal example of a grammar and example input which shows the "wrong" behaviour for you, so we can see what NPeg should provide to make this work for you. Is that doable?

N.B. The one open github issue at this moment is about defining and implementing a proper API for block captures, since the current available interface is a collection of ad-hoc method like functions, injected variables and return values. I think the time has come to give this more attention and try to define this properly, as this is the one issue that prevents me to go "1.0" with npeg. Your issues is definitely related to this.

Apr 04 '20 15:04 zevv

Zevv, thanks for the response.

I will work on putting together a small example for you, but I have a 95% done grammar here: https://gist.github.com/reverendpaco/fc64863d15539349e856a5e1883b73df

I was using Tatsu (for Python) before I decided I needed/wanted a language that could produce static binaries and was linkable to C (hence Nim). I also like that Npeg offers Pratt precedence operators as I hope to parse SQL expressions according to their defined precedence.

In general, the committed choices I used in this grammar came from two sources:

a desire to create a cleanliness/modularization to the look of the grammar
actual ambiguity.

For instance in my Datalog-to-SQL transpiling language, I hope to model Prolog/Datalog predicates to SQL tables, so everything is based off the functor form

employee(*). // employee is the functor

But as I hope to allow internal subqueries, a predicate/query can look like any of the following:

employee([first_name,last_name]). // select first_name,last_name from employee
employee(+[length(last_name)]). // add length as new column to all other columns
employee(-[salary]) // remove the salary column from the list of all columns
employee(*). // select everything
employee(**). // select everything but bind all column names to LVars
[email protected](*). // put an alias on the table employee 'a' 
              // and use the employee table in the 'prod' schema
employee(FirstName,LastName, #... etc). // traditional Datalog positional

So a rule for this would look like in NPeg:

 named_predicate <-  >?alias * ?ns_wrapper * >functor * "(" * functorArgs * ")"
 functorArgs <- projection_like | star | double_star | eraser_like | datalog_like 
 eraser_like <- "-[" * columns * "]" 
 datalog_like <- columns 
 columns <- # recursively defined rule of individual columns

If you want to see a whole bunch of examples of what I hope my language does from transpiling from a Datalog to a SQL, see here: https://docs.google.com/document/d/1qf19JmPbSnX4h-fPgPU6QmL6AvEKVXxUBGtaCbyJenM/edit?usp=sharing

I am only 3 days into learning Nim (what better way then to get in over my head with a parser project?). I am really eager to be able to use it for its performance and interoperability (especially with SQLite which is all in C).

I suppose, in the spirit of being some random person requesting help from an open-source author, I could offer another thought on how I could accomplish this, though it would be a significant change to how NPeg works today and it sidesteps the entire idea of global state: if npeg were not to return a seq[string] in its capture but a richer AST marked up with the rule that generated it, then I could walk the AST and not have to worry about real-time serialization to a concrete model until after parsing.

I am using Tatsu as an example here, but I've seen the Guile peg parser and a some others do it. If I were to have a rule:

some_rule <- >first * second * >third * >fourth * >my_name:fith 
fourth <- "MATCHME!" 
fifth <- "HERE" # second is used for matching but not captured

then the AST returned would be a dictionary/array combo like:

{some_rule: [ {first: ...} , {third: ...}, {fourth: "MATCHME!"}, {my_name: "HERE"} ] }

much like JSON but with only strings , arrays and dictionaries.

I really am a fan of the way Tatsu does this and allows modification/tweaking of the AST as its generated, including changing singletons to arrays and 'join' like rules which auto-rollup separator-delimited lists.

Sorry for the longish comment, and I appreciate all the work you and other open-source library authors put in.

Apr 04 '20 17:04 reverendpaco

I will work on putting together a small example for you, but I have a 95% done grammar here: https://gist.github.com/reverendpaco/fc64863d15539349e856a5e1883b73df

That's a pretty impressive grammar.

I was using Tatsu (for Python) before I decided I needed/wanted a language that could produce static binaries and was linkable to C (hence Nim). I also like that Npeg offers Pratt precedence operators as I hope to parse SQL expressions according to their defined precedence.

Precedence operators are not widely used yet I believe, so I'd love to here how they work for you!

[...]

I am only 3 days into learning Nim (what better way then to get in over my head with a parser project?).

Well, the good thing abou Nim is that if you have some experience in some other language you should be pretty much ok just jumping in like that. Tons of advanced features, but you don't need them to get started with the basics.

I suppose, in the spirit of being some random person requesting help from an open-source author, I could offer another thought on how I could accomplish this, though it would be a significant change to how NPeg works today and it sidesteps the entire idea of global state: if npeg were not to return a seq[string] in its capture but a richer AST marked up with the rule that generated it, then I could walk the AST and not have to worry about real-time serialization to a concrete model until after parsing.

Well, NPeg actually was able to do that up to a few versions ago. It used to be able to grab a basic tree data structure. This is from the README from version 0.20.0: http://zevv.nl/div/ast.html

I once day decided to throw that out, as I had never heard of a single person actually using it, and it was a complicating factor for large refactoring I was doing by then. I also find that it was too limited in the types and forms of trees it could built.

If you think that would solve your problem I could try to see what it takes to ressurect this feature; now the refactorings are done I might be able to fit it in without making things too hairy again.

much like JSON but with only strings , arrays and dictionaries.

Ah, JSon. NPeg used to be able to do that as well, parse a tree and built a JSon tree from it. Abandonded for the same reason as the AST though.

So, maybe it was not a good decision to throw that all out :/

I really am a fan of the way Tatsu does this and allows modification/tweaking of the AST as its generated, including changing singletons to arrays and 'join' like rules which auto-rollup separator-delimited lists.

I'm not familiar with Tatsu, so I'll do some reading up on that. Thanks!

Sorry for the longish comment, and I appreciate all the work you and other open-source library authors put in.

You're welcome. I do appreciate this kind of feedback, it's interesting real-world use cases instead of only tiny toy parsers.

Apr 04 '20 18:04 zevv

I also find that it was too limited in the types and forms of trees it could built.

I have been struggling with this as I've learned Nim. The variant object facility provides a semi-union type, but not quite ideal for the needs of a parser. I would think it's best to ignore richer types beyond just the strings that were matched, the arrays representing the concatenative order in which rules were composed and the table representing the name of rules. Stealing from the JSON source code, I think something like this could model it. Your match() proc could return a seq[NPegNode] and someone could write (or you could provide) a DFS walker to inductively generate their concrete types from this parse tree:

type
  NPegNodeType = enum
    dic,leaf,sequence
  NPegNode {.acyclic.} = object 
    case kind: NPegNodeType
      of dic: table: OrderedTable[string, NPegNode] 
      of leaf: data: string
      of sequence: values: seq[NPegNode]

I am also looking at waxeye since it compiles to C and I could use it from Nim.

Apr 04 '20 20:04 reverendpaco

Quoting reverendpaco (2020-04-04 22:46:30)

I have been struggling with this as I've learned Nim.

Haven't we all. Sometimes I get lost and try to do all kind of things with Nim that only make sense with ducktyped languages, but then you realize it's still a static typed compiled language that forces limits upon you because it wants to know at compile time that things will be ok at run time.

Stealing from the JSON source code, I think something like this could model it. Your match() proc could return a seq[NPegNode] and someone could write (or you could provide) a DFS walker to inductively generate their concrete types from this parse tree:

The very first thing NPeg could do, before AST captures or code block captures, was making JSON captures. You could indicate which parts should be captured as a json string, number or bool, and how inner captures could be stored inside JSON arrays or objects

(http://zevv.nl/div/json.html)

That was also deprecated and thrown out, just like the AST :/

-- :wq ^X^Cy^K^X^C^C^C^C

Apr 05 '20 04:04 zevv

Is there anything I can do to help you out with this, still?

May 17 '20 09:05 zevv

Closing this, feel free to reopen if it's still relevant.

Nov 04 '22 18:11 zevv