Would like to discuss the technology and future of this project.
I too spent a lot of time studying MARPA and implementing its algorithms in Swift. I think I came up with some representational improvements some of which you may have rediscovered, but was never able to make the performance good enough (for reasons I was never able to fully understand). I also have a parse forest representation based on the computed chart, which doesn't require creating a whole new bocage data structure.
I need functionality akin what MARPA provides, but in Rust, including a scannerless front end. I need really solid documentation, and probably some other things not currently available in gearley. I'd be happy to contribute some of these things here.
(I've also done some research into other generalized parsing technologies that you might be interested in)
@pczarn hoping you are open to discussion.
I've done some extensive and much needed refactorings for the recognizer after 2024 summer's experimentation in a miniature of the project. After those refactorings, my C language test is failing at token number 28 (awesome, some progress from doing an infinite loop or failing at token number 0...) You may see most recent commits at #17
This project is a huge sink of time, but with enough determination, a revised version could absolutely be available before the end of this year, with some preview MVP coming even earlier. This is already becoming available at https://pczarn.github.io/ where you may see the bocage dumped as text, external->internal syms, the precomputed matrices, some other internals and my use of the Vue framework and its single file components (sfc) for presentation of #17
Documentation is nonexistent at this point. As a coincidence, today, before I saw your comment, I've been thinking about creating docs/recognizer/README.md, docs/simple-bocage/Node.md etc. I've seen this approach in other people's repos (bitvec and probably Kollos/MARPA). What do you think?
The preformance of YAEP is surely achievable in a reimplementation of MARPA with good enough representational improvements. Could you achieve anything better than YAEP with any generalized parsing techology? Doubt it. Not without tradeoffs. Best I've ever seen is 100-150 ns per Earley set on older CPUs and somewhat simple grammars.
Do tell me what you may need other than the interface gearley provides. I'm open to future contributions. You're the second person to contact me about the design, so this is probably the time to put out documentation and move forward with the presentation in the Vue framework. I'm currently doing some thinking about the scannerless interface.
Happy to see interest in MARPA's technology and in my project. From time to time I am discussing putting out some updates about progress at MARPA's colabti.org/libera irc chat.
I also have a parse forest representation based on the computed chart, which doesn't require creating a whole new bocage data structure.
I've learned to create the bocage(parse forest) structure as separate and its nodes are built parallel to the progress of the parser(recognizer). This idea came after reading Elizabeth Scott's papers. My bocage nodes are created immediately (as soon as possible) and are topologically sorted already on creation. I've found the separation of bocage to be performant and fitting my implementation.
Forgot to mention my schedule allows only 30 minutes of coding on some days and 2 hours on other days, so I am essentially doing this in spare time, bit by bit
Further, I prefer not to accept reports or contributions until the new version is released
This project is a huge sink of time
I hear ya. Parsing is a deep deep well; I put my Earley parser implementation on ice some time ago (the repo “archived”.) Then I got sucked into investigating Grzegorz Herman's work… 🤦
The wasm visualizer is very cool.
I've been thinking about creating docs/recognizer/README.md, docs/simple-bocage/Node.md etc. I've seen this approach in other people's repos (bitvec and probably Kollos/MARPA). What do you think?
I'm not sure what approach you're referring to; more documentation is always welcome but I'm sure you mean something more specific than just adding more files…(?)
Could you achieve anything better than YAEP with any generalized parsing techology? Doubt it.
I am truly blown away by the speed of YAEP. I thought I was really onto something when I discovered I could store Early/Leo items in 1.5 64-bit words and put all of them into one contiguous array; I thought for sure that the performance win from locality would be very significant, but I couldn't even match MARPA performance… I'm still baffled as to why. I do think Herman's technology could plausibly significantly beat YAEP, but it's very hard to understand what he's doing. Recognition is pretty straightforward but actually capturing parse information is a bigger challenge… and I don't currently have very strong performance requirements.
Do tell me what you may need other than the interface gearley provides.
Some things we need; you may already provide some of them:
- Scannerlessness available, but preserve the current ability to parse non-textual content.
- Really complete documentation
- Ability to control the information associated with each token
Some things I want but don't currently need:
- Ability to associate costs with productions, to control the order in which ambiguous parses are visited
- An interface that lets me push tokens into the parser rather than having the parser ask for the next token
- Support for overlapping tokens
Happy to see interest in MARPA's technology and in my project.
I spent a lot of time looking at MARPA myself and even gave Mr. Kegler a walkthrough of some of my code. We discussed what the next MARPA ought to look like and my recommendation was that it should be very much like gearley (safe, value semantic).
I've found the separation of bocage to be performant and fitting my implementation.
🤷 I won't argue, especially since I don't care that much about performance for my use-cases. But if you are trying to match or beat YAEP, the upshot of what I've said is that the extra work could be avoided. From talking to Mr. Kegler it seems what's gained by building a bocage is that you can discard the information the chart holds about unproductive partial parses and still explore the productive ones. But it always seemed to me that building the bocage involves at least exploring all the Earley items that made up complete parses, while a typical application may only be interested in one of those parses. Further, you have higher peak memory use while both the chart and bocage are in existence. Anyway, that's the rationale, but as I said I don't care much for myself whether there's a bocage (and we've already seen my ability to guess the performance effects of a given design is… questionable).
I'm not sure what approach you're referring to; more documentation is always welcome but I'm sure you mean something more specific than just adding more files…(?)
Ah, I mean the files, if separate, may include any kind of writeup: interface, algorithm, usage and possible future improvements. Separate files suggest a freestyle instead of interface documentation. There's a need to find the right focus to start with. You'd be the reader, so your expectations about the documentation are most worth hearing out.
Some things we need; you may already provide some of them:
Scannerlessness available, but preserve the current ability to parse non-textual content.
No, currently there's no code for that. Yes, I intend to build textual interfaces on top of non-textual interfaces.
Ability to control the information associated with each token
In gearley, an uint32 is associated with each token and you're meant to create your own side table to access whatever information you may need.
Some things I want but don't currently need:
Ability to associate costs with productions, to control the order in which ambiguous parses are visited
I had that in mind, but left it untested and forgot to pay attention to it. This is very old code:
pub trait Order {
/// Apply the order to sum node alternatives.
fn sum<'b>(&mut self, alternatives: &'b [Node]) -> &'b [Node] {
alternatives
}
/// Apply the order to product node factors.
fn product(&mut self, _factors: &[(Symbol, u32)]) -> Option<usize> {
None
}
}
We will need to do some changes to this old interface to make it usable.
You'd be expected to inject it as a part of Bocage, perhaps like so:
struct ReverseOrder;
impl Order for ReverseOrder {
fn sum<'b>(&mut self, alternatives: &'b mut [Node]) -> &'b mut [Node] {
alternatives.reverse();
alternatives
}
}
let bocage = Bocage::with_order(&grammar, ReverseOrder);
An interface that lets me push tokens into the parser rather than having the parser ask for the next token
This is already possible with the low-level part of the current inteface. Remains to be done with the higher-level and textual interface.
Support for overlapping tokens
this was already available at Recognizer::scan:
pub fn scan(&mut self, symbol: Symbol, value: F::LeafValue)
Reads a token. Creates a leaf bocage node with the given value. After reading one or more tokens, the parse can be advanced.
Currently you may provide multiple tokens at the same single input position to the low-level interface of the Recognizer like so:
recognizer.begin_earleme();
recognizer.scan(symbol_a, 0);
recognizer.scan(symbol_b, 0);
recognizer.scan(symbol_c, 0);
assert!(recognizer.end_earleme(), "parse failed");
I do think Herman's technology could plausibly significantly beat YAEP
You can beat performance with either science, or engineering. So you have two variables that decide what's the performance. YAEP chose dynamic programming (as far as I know). I chose to find the most intuitive representation of science and attack it with software engineering. This is pretty much what the Rust Compiler Project did with e.g. the borrow checker, too.
I spent a lot of time looking at MARPA myself and even gave Mr. Kegler a walkthrough of some of my code. We discussed what the next MARPA ought to look like and my recommendation was that it should be very much like gearley (safe, value semantic).
Thank you for kind words about gearley's design. Yes, safety and value semantics give best results within the Rust language.
From talking to Mr. Kegler it seems what's gained by building a bocage is that you can discard the information the chart holds about unproductive partial parses and still explore the productive ones.
Here, our approaches are very different. In my imagination, I guess discarding unproductive partial parses is much like garbage collection. You drop multiple items scattered around the chart. Not quite sure how it fits into value semantics. This area is up to exploration.
I am discarding entire chunks of most recent Earley sets. They remain productive in the bocage, but go out of "scope" of the chart while all we need for their evaluation is already in the bocage. This is the code:
fn remove_unreachable_sets(&mut self) {
// utility: get item's origin set index
let origin = |item: &Item<F::NodeRef>| item.origin as usize;
// once our Earley set's done, find out the most recent Earley set
// which any of our items originated at
// (or do nothing if empty)
let max_origin = self.medial.last()
.iter()
.map(origin)
.max()
.unwrap_or(self.earleme());
// this is the first Earley set that we may drop
let new_earleme = max_origin + 1;
// if the current earleme is at the position we may drop,
// we have nothing to drop, because we want to move the current
// earleme into `new_earleme`
// Also, we do not want to touch our Start of Input (new_earleme != 1)
if self.earleme() > new_earleme && new_earleme > 1 {
// truncate our prediction matrix
self.predicted[new_earleme + 1].clear();
self.predicted.truncate(new_earleme);
// this changes the current earleme:
// `truncate` has special handling for the last Earley set
self.medial.truncate(new_earleme);
// for clarity, obvious checks
debug_assert_eq!(self.medial.len(), new_earleme - 1);
debug_assert_eq!(self.earleme(), new_earleme - 2);
}
}
In gearley, an uint32 is associated with each token and you're meant to create your own side table to access whatever information you may need.
As long as there's a way to get at the side table when I need it without using reference semantics, that sounds fine.
Sorry, I don't have enough context to understand the code you've shown, but it probably doesn't matter that much.
Forgive me I'm missing something, but in the support for overlapping tokens, what I don't see is a way to make tokens that span multiple input positions. In the general case, your input is a lattice of tokens. The way I understand the efficient version of Earley where you build on one Earley set at a time and never modify it thereafter, the tokens need to be processed in order of their end positions.
AEP chose dynamic programming (as far as I know). I chose to find the most intuitive representation of science and attack it with software engineering.
Sorry if I've misunderstood you, but I'm pretty sure that if you've implemented Earley, you've chosen both. As far as I can tell dynamic programming is inherent to the algorithm. That's what allows the same mainstem to contribute to more than one manifestation of the rule's RHS.
Here, our approaches are very different.
Yours and MARPA's?
In my imagination, I guess discarding unproductive partial parses is much like garbage collection. You drop multiple items scattered around the chart.
The way I understand MARPA, the Bocage is a separate data structure that is built from the Chart. So you drop the entire Chart eventually after capturing all the productive partial parses in the Bocage. That's what always seemed pointless to me. It sounds like you are not building a separate data structure(?) in which case I'm not sure what the distinction is between the Bocage and the Chart.
Not quite sure how it fits into value semantics.
Sorry, maybe I wasn't clear; when I say value semantics I mean Mutable Value Semantics. That's the model upheld by Rust so long as you don't subvert it with shared access things like Rc. Rust tries to have references and then tame them, but that's just a complicated way to present MVS.
a way to make tokens that span multiple input positions.
Ah yes, I expected you to mention multiple input positions, just without certainty what they might be useful for. This feature ought to be easy to add with the current code as long as the user signals to the recognizer the first input position of the token, otherwise the feature will conflict with the code I mentioned earlier as the part of the chart might be simply deleted. That code will have to be modified, as well. Let's say we add a function 'begin_scan' and 'end_scan':
/// Marks the beginning of multi-position token in the current Earley set.
fn begin_scan(symbol: Symbol)
/// Scans the multi-position token in the current Earley set. Call `begin_scan` beforehand.
fn end_scan(symbol: Symbol, num_earley_sets: u32, associated_value: u32)
recognizer.begin_earleme();
recognizer.scan(symbol_a, 0);
recognizer.scan(symbol_b, 0);
recognizer.begin_scan(symbol_c_spanning_two_earlemes);
assert!(recognizer.end_earleme(), "parse failed");
recognizer.begin_earleme();
recognizer.scan(symbol_x, 0);
recognizer.end_scan(symbol_c_spanning_two_earlemes, 2, 0);
assert!(recognizer.end_earleme(), "parse failed");
The way I understand MARPA, the Bocage is a separate data structure that is built from the Chart. So you drop the entire Chart eventually after capturing all the productive partial parses in the Bocage. That's what always seemed pointless to me. It sounds like you are not building a separate data structure(?) in which case I'm not sure what the distinction is between the Bocage and the Chart.
Ah, I understand. The distinction between the Bocage and the Chart is conceptual but not temporal or even really structural. Yes, the Chart and the Bocage are both in existence during the parse. Recall that Earley sets have three different kinds of items:
- predicted partial parse: alpha ::= • beta gamma
- in-progress partial parse: alpha ::= beta • gamma
- complete partial parse: alpha ::= beta gamma •
Internally, each belong to a different data structure during the parse. Complete partial parses are useless for further advancement of the parse, so they are only briefly included in the Chart. They need NOT be a part of any dynamic programming, so they end up in the "Bocage" very soon as nodes of a topologically sorted Directed Acyclic Graph. In what sense is the "Bocage" separate? It remains a substructure of the Recognizer until completion of the parse, when it becomes separate from the recognizer as the parse result together with the handle of the parse's node, without any additional steps, ready for evaluation. It needs to be a separate concept to represent the result of the parse.
Further, you have higher peak memory use while both the chart and bocage are in existence.
Yes, but in our variants (MARPA and gearley) this is a zero-sum game of storing both complete partial items and useless items/nodes: I have a bigger bocage, MARPA has a bigger chart. The solution for gearley's peak memory use might be to implement a compaction procedure for my bocage, which is up for exploration. You've mentioned modest performance constraints, what about your memory constraints?
Sorry if I've misunderstood you, but I'm pretty sure that if you've implemented Earley, you've chosen both. As far as I can tell dynamic programming is inherent to the algorithm.
Ah, the dynamic programming inherent to the algorithm is there. I meant to say I don't do that many tricks in the recognizer. All the complex tricks are included in preprocessing of the grammar. My grammar preparation has several algorithms so that I'm gaining easier parsing at the expense of having a quite exotic grammar format.
Internally, each belong to a different data structure during the parse.
Oh that's an interesting idea. By the end I was memoizing prediction sets and that was a real speedup for me (although Mr Kegler said he tried that and it didn't help MARPA) which effectively separated predictions from the rest, but I didn't separate completions. In my implementation, the in-progress parses were still needed at the end to recover the relationship between intermediate dot positions and input position. But if you can store the token IDs at the leaves of your forest that probably isn't necessary for you.
Are you going all the way to CNF? I found a way to eliminate nulling symbols that results in smaller grammars, but I couldn't swear it makes a measurable difference.
Are you going all the way to CNF?
Almost. I don't care that much where my symbols are terminal or nonterminal.
Care to explain? I didn't care much either but I can't see the relevance of that statement to how much grammar transformation one does or whether you use CNF.
In case you care, this is the fundamental step I used to eliminate nullable symbols.
My RHS is always 1 or 2 symbols - binarized, just like CNF.
Only Start can be nulling, just like CNF.
One terminal and one nonterminal may appear on RHS, unlike CNF. Two terminals, too.
Unit ~~productions~~ rules are permitted, unlike CNF. I had to forbid infinite cycles of unit derivations.
in-progress parses were still needed at the end to recover the relationship between intermediate dot positions and input position. But if you can store the token IDs at the leaves of your forest that probably isn't necessary for you.
Yes, you may store or figure out the input position in gearley. My leaves are Node::Leaf { terminal: Symbol, associated_value: u32 }
If I may ask, what's the motivation for breaking rules down so there are no more than 2 symbols on the RHS?
- Much easier to eliminate nulling symbols after binarization. Just like in Chomksy Normal Form. There are only 4 choices per rule, I called them:
enum BinarizedRhsRange { All, Left, Right, None } - I find it easier and more intuitive to handle only rule IDs, not more complex dotted rules. Since I keep RHS positions in different structures, dotted rule is obvious from rule ID.
- In-progress items need NOT be deduplicated, that would be for rules of length 3 or more (dotted rules: (2, N)).
- Within the recognizer, I exploit the fact that "gensyms" - symbols generated during binarization - always appear in one place in the grammar, always on the first place of RHS, so handling them can be made much simpler, reducing the conse* quences of binarization
By the way, to clarify, the internals of gearley are mostly my own invention. There's high inspiration from MARPA's documentation and from some discussion with Mr. Kegler. I did read MarpaC to figure out bocage traversal, but that algorithm in gearley was later modified and removed. Therefore I am not sure whether my algorithm can be called MARPA. Further, we have a problem to solve: time complexity for ambiguous grammars is unproven for my algorithm. I believe it might be worse by a factor of log n or (log n)^2.
Early/Leo items in 1.5 64-bit words
Interesting, what do you store in there?
what's the motivation for breaking rules down
Ah, I recall it now. The main motivation is constructing a binarized bocage (Shared Packed Parse Forest) to avoid exponential blowup.
Most of the item storage is described here. There's another 32 bits for the index in the chart where the mainstem of the item can be found.
Mind explaining what exponential blowup you're referring to and how binarizing helps avoid it?
With regard to rule IDs vs dotted rules, I have a grammar layout (again in a single array) where a single integer identifies any rule or dotted rule. Or is it the conceptual overhead of dotted rules you're talking about? Isn't there a dot partway through the recognition of a RHS, even a binary one?
Not needing to deduplicate in-progress items is interesting; I don't think I understand why binarizing would make a difference in that (and frankly have forgotten how duplicates even arise in the algorithm—my recognizer deals with them so I know they do).
Within my recognizer, gensyms are just like any other symbol so there's no special handling. I kept a mapping from the transformed grammar positions to the original, which I used to change what a user observes when examining the completed parse forest.
I don't think I understand why binarizing would make a difference in [no duplicated in-progress items]
As long as completion is invoked once per (origin, symbol predicted there) we must get by necessity no more than one item per rule that could be advanced there
exponential blowup
No idea. I'm not that knowledgeable. Took this from Elizabeth Scott and just ran with it:
A shared packed parse forest (SPPF) is a representation designed to reduce the space required to represent multiple derivation trees for an ambiguous sentence. In an SPPF, nodes which have the same tree below them are shared and nodes which correspond to different derivations of the same substring from the same non-terminal are combined by creating a packed node for each family of children. Examples are given in Sections 3 and 4. Nodes can be packed only if their yields correspond to the same portion of the input string. Thus, to make it easier to determine whether two alternates can be packed under a given node, SPPF nodes are labelled with a triple (x, j, i) where aj+1 ...ai is a substring matched by x. To obtain a cubic algorithm we use binarised SPPFs which contain intermediate additional nodes but which are of worst case cubic size. (The SPPF is said to be binarised because the additional nodes ensure that nodes whose children are not packed nodes have out-degree at most two.)
Most of the item storage is described here
So, instead of the bocage node ~~ref~~ id, you have a mainstem. Also, you have the LHS symbol which I don't need. Our storage is of equal size, because my dot is 32 bits.
@dabrahams here's a bug if you're up to it. Wasm visualizes shows logs as: error "unreachable". I have no idea where this message is coming from. Could you take a look?
I'm trying to debug the C grammar by adding the choice to use the C lexer until the scanless interface is available.
So, instead of the bocage node ref id, you have a mainstem. Also, you have the LHS symbol which I don't need.
It's only the LHS symbol when the rule is a completion, I think, and I'm not sure I need that; it probably just falls out of the code because of the way my grammars are stored (RHS symbols, then LHS). For non-completions it's the transition symbol. The symbol is there so I can keep each Earley set sorted and use binary search to look up various things… like transition items. To be clear, I'm not saying anything about what I did is better; obviously I wasn't able to make it perform well. Just thought it might be of some interest and there might be some useful ideas.
So far the only thing I think might be worth your time to look into at some point is the consequences of a non-binarized grammar representation. Obviously in the worst case I could end up with a binarized grammar too, but I think in most cases my more-minimally-factored grammar could be significantly smaller and thus generate significantly fewer earley items. Maybe.
About exponential blowup: I read that bit too when I was doing my work, but I didn't see how any representation was going to be smaller in big-O terms than the chart itself (except for the items in the chart that didn't end up as part of any complete parse). The chart already maximally shares subtrees by one measure. Binarizing could result in more sharing if you collapse synthesized LHS symbols when all their RHSes are identical. I think that's more likely to happen in a binarized grammar but that's pure speculation on my part.
Our storage is of equal size, because my dot is 32 bits.
Cool. FWIW, dot position in my case means its index in the grammar, which is just all the symbols of the rules laid end-to-end; 16 bits was thought to be sufficient to represent all grammars. But it always felt like a bit of a squeeze; it's nice that gearley isn't living on the edge that way.
As for your wasm visualizer stuff, I'd be happy to try to help but keep in mind I've never done any wasm work and anything I knew about web development is about 15 years out of date. But if you give me more context, like how to reproduce the problem and get the code, I could take a crack at it.
I just looked, saw the "logs" switch, flipped it, and got what looks OK to my eye (attached).
~~@dabrahams I loaded the most recent code (currently here on master despite work in progress) and made another attempt to debug. So essentially, the wasm call throws an Out Of Memory exception from Lexer::tokenize/Vec::push because we're loading a big C grammar and somehow the WASM can't handle that much memory. Makes me wonder how could one load big grammars in the WASM.~~
To sum up, WASM does not play well with such large grammars. Not sure if WASM in Rust is mature enough to handle larger memory programs.
This went undetected because I was not showing the call stack, just catch (e) { text = e.message }. And an OOM error shows as "unreachable" (abort??)
Nevermind, I had an infinite loop at tokenization error. Good to have it caught.
@dabrahams You have the most recent wasm code on master branch, and you may try to fix the disfunctional recognizer
Choose "c-lexer" & "C with lexer"
We can do this together.
Here is the input to get parsed as a C file
static reg_errcode_t
byte_regex_compile (pattern, size, syntax, bufp)
const char *pattern;
size_t size;
reg_syntax_t syntax;
struct re_pattern_buffer *bufp;
{
The recognizer does not expect the { on the last line in this snippet. It expects a continuation of the list of parameter definitions of the function, or (??) another top level definition. Bet it could be something with completion or prediction. No idea whether gensyms are involved, really. I believe the grammar is ok as well as tokenization is ok for this to get parsed.
I'll try to help; keep in mind that I don't know your code, though. Meantime, maybe it would be good for you to open an issue for things like this so we can have a separate discussion there? In that issue you could explain what "choose" means in:
Choose "c-lexer" & "C with lexer"
I remembered about the exponential explosion issue. It's simple, really: if you de-nullify the grammar by replacing a RHS with all permutations with its N nullable symbols removed, you get 2^N new rules for that one RHS. But you don't need to binarize rules to solve that problem, which creates O(M) new rules where M is the length of the RHS. My approach logically does the same thing as binarized de-nullification; you could almost think of it as replacing strings of consecutive non-nullable symbols with gensyms that expand to the same string, and then doing the binarized de-nullification (but what I do is smaller still), resulting in only O(N) new rules.
Yes, the approaches are different. Thanks for clearing that up. I think we don't need to care that much about performance after binarization, . The main culprit is the time and space complexity for ambiguous grammars.
I wonder whether ambiguous grammars could be benchmarked in practice without finding proofs for their performance.
Keep in mind if my explanations are unclear, that's certainly possible and do ask for clarification because I am resting after not feeling that well.
It may be helpful to ask: what are your goals? Personal project, scientific, other?
I'm trying to improve the readability and functionality of the wasm visualizer. As a start, it's now working under Vue's Vite and PrimeVue.