lrpeg
lrpeg copied to clipboard
Suppress output of unnecessary nodes
This is my grammar file,
https://github.com/oovm/lrpeg-test/blob/master/projects/lrpeg/src/ygg.peg
This is my parsing output of x=0
https://github.com/oovm/lrpeg-test/blob/95b032c9ca0fd46c39ea31edf480d41eaecdec1b/projects/lrpeg/tests/assign.yaml#L23-L34
I don't understand why there are two Terminals here, and the statement
nodes I need are wrapped in children.
In my understanding, Terminal is ε
or string
or regex
, it should have no children.
You are right, this looks broken.
(statement IGNORE)*
becomes a node in tree, which is incorrectly labelled Terminal
. I've just pushed a change which labels them List
(amongst others). Let me know how this works for you.
I'm just translating pest
program = {SOI ~ statement* ~ EOI}
vs
program <- IGNORE (statement IGNORE)* EOI;
// IGNORE means anything that can be skipped
IGNORE <- space* / newline* / comment?
a ~ b => a IGNORE b
a ~ b? => a IGNORE b?
a ~ b* => a IGNORE (b IGNORE)*
a ~ b+ => a ~ b ~ b*
From the results, pest did not generate additional nodes
Okay, it makes sense, after I traverse it once and flatten it, the result is correct
lrpeg does generate too many nodes. Lots of them do not have useful information.
Does pest have a way of marking a rule/line as "do not generate nodes for this" or it is clever in some other way?
In my opinion, whether it is useful or not needs to be determined according to the purpose. My classification is like this
- Useless: Hard to think of usage
-
ε
, EOI
-
- Ignored: Formatting needs to use these semantics
- comment, space, newline
- Unnamed(Weak semantics): macros need to use these semantics
- keywords, brackets, operators
- Effective semantics:
- others
According to this classification, my filter looks like this
pub fn flatten(node: Node) -> Node {
let mut buffer = vec![];
for node in node.children {
flatten_rec(node, &mut buffer)
}
Node {
rule: node.rule,
start: node.start,
end: node.end,
children: buffer,
alternative: node.alternative,
}
}
pub fn flatten_rec(node: Node, buffer: &mut Vec<Node>) {
match node.rule {
// flatten these nodes
Rule::Any | Rule::List => {
for node in node.children {
flatten_rec(node, buffer)
}
}
// not important
Rule::EOI => {}
#[cfg(feature = "no-ignored")]
Rule::IGNORE => {}
#[cfg(not(feature = "no-ignored"))]
Rule::IGNORE if node.start == node.end => {}
#[cfg(feature = "no-unnamed")]
Rule::Terminal => {}
#[cfg(not(feature = "no-unnamed"))]
Rule::Terminal if node.start == node.end => {}
_ => buffer.push(flatten(node)),
}
}
How can the parser generator decided which nodes to create and which not to create nodes for?
We could take inspiration from pest and do not create nodes for rules which start with an underscore.
It sounds like a feasible design, but
If the node is hidden, who will hold the label
and alternative
attached to the node.
eg: what's the result of 1+2
under rule:
expr <- _expr0;
_expr0 <-
add:/ <lhs:_expr0> ("+"/"-") <rhs:_expr1>
/ _expr1;
expr1 <-
mul:/ <lhs:_expr1> ("*"/"/") <rhs:_expr2>
/ _expr2;
_expr2 <-
pow:/ <lhs:_expr3> "^" <rhs:_expr2>
/ _expr3;
_expr3 <- num:/num;
So the idea is that if a node is hidden, then it will inherit its (non-hidden) children. So lhs
and rhs
are in the parse tree, even though _expr0
is not.
Now that does leave the question about the alternative
though..