textmapper
textmapper copied to clipboard
bug: Incorrect parent and firstChild set for leftmost and rightmost nonterminals of zero length
Creating a dedicated issue to track the bug identified at https://github.com/inspirer/textmapper/pull/17#issuecomment-433245585
Edit: For added context, the only hand-written files are parser.go and tree.go, both of which are mostly identical copies of the original parser.go and tree.go of the TextMapper project. Which is why this bug is interesting to fix as it pertains to TextMapper itself.
I'll add my latest debug session below, feel free to skim as it's quite long :)
The debug session below is debugging the none command, which uses the mini.tm grammar, and is invoked on the example.ll input file.
Specifically the following commands may be used to reproduce the debug session:
$ go get -u github.com/mewspring/foo/none/cmd/none
$ go get -u github.com/derekparker/delve/cmd/dlv
$ go get -u github.com/aarzilli/gdlv
$ cd $GOPATH/src/github.com/mewspring/foo/none/cmd/none
$ $GOPATH/bin/gdlv debug ../../example.ll
Cheers, Robin
Debug session
Is it correct that next should be overwritten here?
Then we will skip /* empty */
nonterminals that are in between other nonterminals.
Should it really be o >= endoffset
? And not o > endoffset
? Now, we decrease the end, so that InstMetadata will not be part of the call instruction.
OperandBundles and InstMetadata were not assigned to have CallInst as parent. They should have been. Note, end was too small. Most likely as a cause of using if o >= offset { end-- }
instead of if o > offset { end-- }
This is probably wrong, as now OperandBundles and InstMetadata won't have CallInst as parent.
This is where it goes wrong. Really wrong. As the OperandBundles and InstMetadata of the previous CallInst are added to the current CallInst.
The only reason this happens is since OperandBundles and InstMetadata are zero in length, thus offset if not enough to distinguish which parent nonterminal they belong to.
This is the incorrect firstChild. It points to InstMetadata, but should really point to
Incorrect parents have now been set for InstMetadata and OperandBundles.
No parent was set for OperandBundles and InstMetadata. They should have both been assigned the last CallInst as parent (i.e. index 18).
Since InstMetadata (at index 10) is now the firstChild of the CallInst at index 18, and since the InstMetadata does not have a next (i.e. next is 0), the CallInst will not have any other children than the incorrectly added InstMetadata child. Therefore, calling Typ on CallInst will result in NONE type.
InstMetadata (at index 17) of previous instruction being added to have RetTerm as parent. The correct InstMetadata to add to RetTerm is at index 20.
FuncMetadata incorrectly added to have FuncBody as parent.
InstMetadata incorrectly added to have FuncBody as parent (should have been part of TermRet).
Panic with unknown node type NONE
, since CallInst has InstMetadata as first child.
Yeah, this is working as expected with the given data model. The parser produces a bunch of ranges, and the AST builder looks at how they are nested to produce the correct structure. This breaks profoundly with nullable nonterminals. First of all, the parser cannot properly position a nullable nonterminal in source code, and just puts it right before the next token, consider:
input : A d ;
A : b copt ;
and b d
as an input. There are two possibilities for positioning copt: right after b
and right before d
. Textmapper always picks the latter, which leads to including trailing spaces in both copt
and A
. The proper positioning depends on the outer rules, and we don't know which rule is going to be applied.
There are two solutions to this problem:
- To avoid (forbid?) nullable nonterminals at first and last positions in a rule. This is the rule I'm currently following everywhere. Textmapper should flag when a nullable terminal is used in such a way (a TODO for me).
- Propagate more information into the AST builder so that it can adjust offsets for empty nodes. This is a very valid approach but it comes with performance penalties I'd like to avoid (we can/should still experiment and measure but my gut feeling says that avoiding nullable terminals in the first place is a much cheaper and better approach).
The rewritten grammar should look like this:
input : A d ;
A : b c? ;
For now, I'll use this issue to work on flagging occurrences of nullable nonterminals in unwanted locations in event based-parsers.
To avoid (forbid?) nullable nonterminals at first and last positions in a rule. This is the rule I'm currently following everywhere. Textmapper should flag when a nullable terminal is used in such a way (a TODO for me).
I see your point and I was thinking of this too. This is definitely the simplest path to take for TextMapper. It may however put additional constraints on TM grammar writers. Specifically for the grammar of LLVM IR, optional nonterminals as used everywhere. So for this grammar, they would have to be rewritten to use Fooopt
or Foo?
or Bar*
or Bar+?
instead of nullable nonterminals.
GlobalDecl uses this approach:
GlobalDecl -> GlobalDecl
: Name=GlobalIdent '=' ... GlobalAttrs=(',' GlobalAttr)+? FuncAttrs=(',' FuncAttr)+?
;
It would break for the current implementation of FuncDef I believe as FuncMetadata
could belong to either FuncHeader
, FuncBody
or FuncDecl
. With the current implementation it would belong to FuncBody
right? It should belong to FuncDef
.
FuncDef -> FuncDef
: 'define' Header=FuncHeader Metadata=FuncMetadata Body=FuncBody
;
The solution being to inline the definition of FuncMetadata wherever used.
FuncMetadata -> FuncMetadata
: MetadataAttachments=MetadataAttachment*
;
Just to give a sense for how much LLVM IR uses optional nonterminals, below follows the definition of FuncHeader:
FuncHeader -> FuncHeader
: (Linkage | ExternLinkage)? Preemptionopt Visibilityopt DLLStorageClassopt CallingConvopt ReturnAttrs=ReturnAttr* RetType=Type Name=GlobalIdent '(' Params ')' UnnamedAddropt AddrSpaceopt FuncAttrs=FuncAttr* Sectionopt Comdatopt GCopt Prefixopt Prologueopt Personalityopt
;
I mainly give these specific usage examples to have a concrete grammar as a reference when deciding on either approach 1) or 2). I too would like to keep things simple. Still, the intuitive approach as a user of TextMapper would be if nonterminals belong to their respective parent, as defined in the grammar. And this would be approach 2.
Kindly, Robin
I think we can introduce a generation option and have both, with 2 being the default. With the go rewrite, we will get declarative inlining, which should make a lot of things easier.
FuncMetadata inline
: (MetadataAttachments=MetadataAttachment+ -> FuncMetadata)?
;
I tried to manually inline the instruction metadata attachments in the LLVM IR grammar. It works. I get unknown node type NONE
before inline, and successful resolution after inlining (i.e. I'm able to invoke Typ
of CallInst
).
The inlining is done by removing the InstMetadata
definition, and updating each instruction to make use of Metadata=(',' MetadataAttachment)+?
instead.
Old:
XorInst -> XorInst
: 'xor' X=TypeValue ',' Y=Value InstMetadata
;
InstMetadata -> InstMetadata
: MetadataAttachments=(',' MetadataAttachment)+?
;
New:
XorInst -> XorInst
: 'xor' X=TypeValue ',' Y=Value Metadata=(',' MetadataAttachment)+?
;
However, one point to take into consideration is that inlining InstMetadata
increased the size of the generated parsing tables considerably.
Before inlining:
$ make -C asm/ll -B
lalr: 0.223s, text: 0.517s, parser: 2167 states, 184KB
After inlining:
$ make -C asm/ll -B
lalr: 0.261s, text: 0.526s, parser: 2166 states, 225KB
So, the increase in table size from this inlining alone is 22%. Just raising this to take into consideration for the implementation of the declarative inlining in the Go version of TM.
Yes, inlining is not always for free and I agree that we should have an option to patch locations for nullable nonterminals. I'll work on this.
Just to add to the discussion, if approach 2 tracks the parent of each non-terminal, perhaps such a solution could also allow for non-terminals to appear at different locations within a production rule. Once more, approach 1 would favour simplicity (and speed) whereas approach 2 could favour intuitive use.
Essentially, as a user of Textmapper, I'd be fine with such trade-offs. There may of course be some fundamental aspects of how the shift/reduce parser operates that makes it not possible to do this for approach 2, if so I'd be glad to learn about them.
For a concrete example of the above that hit us today, see the following grammar:
FuncDecl -> FuncDecl
: 'declare' Metadata=MetadataAttachment* Header=FuncHeader
| 'define' Header=FuncHeader Metadata=MetadataAttachment* Body=FuncBody
;
Which results in the following Textmapper error:
ll.tm,966: named elements must occur in the same order in all productions
Cheers, Robin