Arpeggio
Arpeggio copied to clipboard
Symbols with empty match are not preserved
Here is a minimal example.
Given the grammar:
abc = a b c
a = "a"
b = r'b?'
c = "c"
When it parses "abc" the result is normal,
but when it parses "ac" the NonTerminal b just disapper in the parse tree like this:
>>> print(parse_tree)
a | c
The result is the same when using b = "b"? instead.
Since b is actually matched, shouldn't it be in the parse tree (like a or c) with node.value == ''?
That was an early design decision, to remove elements that consume no input from the tree. IIRC the motivation was to make parse tree minimal and thus lower memory consumption but albeit it lead to the difficulties in processing parse trees as now you can't rely on the constant number of child nodes.
I don't think this behavior will change any time soon as all users of Arpeggio depend on it (e.g. textX) so it would be a very disruptive change.
There is a way to access nodes in non-terminal by name which might be a good general solution that wouldn't require change in the current behavior. I haven't checked if that is working at the moment for non-existing nodes but I guess that returning None for optional matches on access by name would be the way to go.
@igordejanovic Hi! I try the method you provided.
The empty NonTerminals still don't appear when accessed by rule name.
That causes problem with grammar like this:
a = b c "/" b c
b = r'b?'
c = r'c?'
>>> parse_tree = parser.parse("c / b")
>>> print(parse_tree)
c | / | b
>>> print(parse_tree.b)
b
>>> print(parse_tree.c)
c
Now I have to check the content for every combination of parse_tree to find out which b is presented.
Would it be possible to have an option to retain all the childrens?
Would it be possible to have an option to retain all the childrens?
Probably it would. I'm trying to figure out a general solution.
For example, you would have the same problem if you use optional rule instead of ? in regular expressions:
a = b? c "/" b c?
b = r'b'
c = r'c'
And there is also ZeroOrMore rule (*) which can match zero times:
a = b* c "/" b c*
b = r'b'
c = r'c'
All the grammars above match your input and return the same tree.
Let's leave this open as a feature request. It seems that this needs more analysis.