gigatron-rom
gigatron-rom copied to clipboard
lcc: compiler hangs on complex expression
int f(int a)
{
return 16807 * (a % 27773) - 2836 * (a / 27773);
}
Simplification that still hangs the compiler:
int main(int a)
{
return 1 * (a / 2);
}
We seem stuck in a never ending recursion:
ASGNI2(ADDRLP2(29), LOAD(INDIRI2(VREGP(vAC))))
dumpcover(1a005e00) = stmt: ASGNI2(ADDRLP2,LOADI2(INDIRI2(VREGP))) / (action)
(listing 1a005e00)
(rallocing 1a005e00)
(replacing 1a005130 with a reload from 56)
(genreload: INDIRI2(ADDRLP2(27)))
INDIRI2(ADDRLP2(29))
dumpcover(1a006028) = stmt: reg / #
dumpcover(1a006028) = reg: INDIRI2(ADDRLP2) / (action)
(reprune changes 190071a0 from 1a005130 to 1a006028)
(inserting load from vAC to r1)
(targeting eb98b078->x.kids[0]=1a0060e8 to r1)
(listing 1a006028)
(freeing vAC)
(free[0]=ffffffff)
(free[1]=ffffffff)
(allocating vAC to node 1a0059c0)
(free[0]=fffffffe)
(free[1]=ffffffff)
(rallocing 1a006028)
local 30 of size 2 @ offset 58
(spilling vAC to local 58)
(genspill: INDIRI2(ADDRLP2(28)))
(targeting 1a006348->x.kids[0]=1a0062b8 to vAC)
(inserting load from vAC to vAC)
(targeting 1a006468->x.kids[1]=1a0064f8 to vAC)
ASGNI2(ADDRLP2(30), LOAD(INDIRI2(VREGP(vAC))))
dumpcover(1a006468) = stmt: ASGNI2(ADDRLP2,LOADI2(INDIRI2(VREGP))) / (action)
(listing 1a006468)
(rallocing 1a006468)
(replacing 1a0059c0 with a reload from 58)
(genreload: INDIRI2(ADDRLP2(28)))
INDIRI2(ADDRLP2(30))
dumpcover(1a006690) = stmt: reg / #
dumpcover(1a006690) = reg: INDIRI2(ADDRLP2) / (action)
(reprune changes 190071a8 from 1a0059c0 to 1a006690)
(inserting load from vAC to r1)
(targeting eb98b078->x.kids[0]=1a006750 to r1)
(listing 1a006690)
(freeing vAC)
(free[0]=ffffffff)
(free[1]=ffffffff)
(allocating vAC to node 1a006028)
(free[0]=fffffffe)
(free[1]=ffffffff)
(rallocing 1a006690)
local 31 of size 2 @ offset 60
(spilling vAC to local 60)
(genspill: INDIRI2(ADDRLP2(29)))
(targeting 1a0069b0->x.kids[0]=1a006920 to vAC)
(inserting load from vAC to vAC)
(targeting 1a006ad0->x.kids[1]=1a006b60 to vAC)
ASGNI2(ADDRLP2(31), LOAD(INDIRI2(VREGP(vAC))))
dumpcover(1a006ad0) = stmt: ASGNI2(ADDRLP2,LOADI2(INDIRI2(VREGP))) / (action)
This looks funny:
(inserting load from vAC to vAC)
These vAC's are different symbol objects with the same name. The rightmost is created by mkreg(). The left one by genspill(). "Our" inset() treats them as different. (But so does the original LCC code before inset() was added.)
The same hangup issue occurs with the shift operators:
int main(int a)
{
return 1 << (a >> 2);
}
Interestingly, these operators get the same treatment in gt1.c:
case LSH:
case RSH:
// LSH reg, [1,8] are unary operators. LSH reg, reg is a helper call.
if (generic(p->kids[1]->op) == CNST) {
unsigned amt = p->kids[1]->syms[0]->u.c.v.u;
if ((generic(p->op) == LSH && amt <= 4) || amt == 8) {
rtarget(p, 0, wregs[0]);
break;
}
}
// fallthrough
case MUL:
case MOD:
case DIV:
// Helper calls. We can pick the targets to help avoid spills. We'll put the LHS in r0 and the RHS in vAC.
// Note that we first target the LHS to vAC in order to ensure that a copy-through-vAC is inserted.
rtarget(p, 0, wregs[0]);
rtarget(p, 0, wregs[1]);
rtarget(p, 1, wregs[0]);
break;
This is getting a bit outside my comfort zone. @pgavlin, is this related to #73 in some way?
If I replace
int inset(Symbol rs, Symbol r) {
int i;
if (!rs->x.wildcard) {
return rs == r;
}
with
int inset(Symbol rs, Symbol r) {
int i;
if (!rs->x.wildcard) {
return !strcmp(rs->name, r->name);
}
then the weird 'vAC to vAC' messages are gone. But the compiler still enters an unbounded recursion :-(
LCC gets in an infinite loop of spilling and reloading vAC. Initially it spills vAC at some point, which is fine and understandable. Later it reloads, but at that point vAC is allocated, so it wants to spill it again. etc
(The message “inserting load from vAC to vAC” looks like a red herring we can ignore for now.)
LCC gets in an infinite loop of spilling and reloading vAC. Initially it spills vAC at some point, which is fine and understandable. Later it reloads, but at that point vAC is allocated, so it wants to spill it again. etc
Yep. I attempted to address this by extending the RA's support for register sets to include partial sets as well as complete sets and ensuring that all nodes that require one operand in vAC require the other operand in anything besides vAC. My guess is that I missed a spot somewhere.
It would be great to have a generalized solution for the infinite spill/reload cycle, though I'm not sure what that would be. I believe that the approach I described above will work, but it is more ad-hoc than I would like.
It’s a nice brain teaser now that I understand what happens. It indeed begs for a simple generalised solution, but I don't "see" it (yet).
int main(void)
{
int a, b, c;
return a * (b / c);
}
Annotated simplified debug output
(rallocing 4b80d298 = LOAD(INDIRI2(ADDRLP2(a))) ) r1 <- vAC <- argument a
(spilling r1 to local 1) spilling r1, because need again for compiling DIV
ASGNI2(ADDRLP2(1), LOAD(INDIRI2(VREGP(r1)))) local 1 <- vAC <- r1. vAC is free, so no problem
(genreload: LOAD(INDIRI2(ADDRLP2(a)))) what's in r1
INDIRI2(ADDRLP2(1)) addressing local 1
(spilling vAC to local 2) now we spill the result of DIV, for the reload of 'a'
ASGNI2(ADDRLP2(2), INDIRI2(VREGP(vAC))) local 2 <- vAC
(genreload: DIVI2(LOAD(INDIRI2(ADDRLP2(b))), INDIRI2(ADDRLP2(c)))) what's in vAC
INDIRI2(ADDRLP2(2)) addressing local 2
(targeting ed2a6078->x.kids[0]=4b808de0 to r1) the reload of r1 (LHS for MUL) ??? See below
(allocating vAC to node 4b808760)
(rallocing 4b808d20 = INDIRI2(ADDRLP2(2)) ) the reload of vAC (RHS for MUL)
(spilling vAC to local 3) ehhh, no! vAC *should* be available but it aint!
The above annotation is not quite right. Some speculation: the MUL needs two reloads (LHS and RHS), but the compiler tries to emit them in reverse order?
What does the full tree look like post-canonicalization?
It is called only at the beginning:
post-canon: RETI2(MULI2(INDIRI2(ADDRLP2(a)), DIVI2(INDIRI2(ADDRLP2(b)), INDIRI2(ADDRLP2(c)))))
But after that the tree gets modified.
Ah, I always forget that the post-canon trees aren't that helpful for diagnosing RA issues.
Immediately prior to the register allocator, we should see INDIRI2(ADDRLP2(a))
replaced by either LOAD(INDIRI2(ADDRLP2(a)))
or LOAD(LOAD(INDIRI2(ADDRLP2(a))))
, and INDIRI2(ADDRLP2(b))
replaced by a similar tree. These loads should be inserted as a result of this code snippet from earlier:
case MUL:
case MOD:
case DIV:
// Helper calls. We can pick the targets to help avoid spills. We'll put the LHS in r0 and the RHS in vAC.
// Note that we first target the LHS to vAC in order to ensure that a copy-through-vAC is inserted.
rtarget(p, 0, wregs[0]);
rtarget(p, 0, wregs[1]);
rtarget(p, 1, wregs[0]);
break;
Do those LOAD
nodes--which represent vreg->vAC or vAC->vreg copiess--appear to be present?
Yes, I believe that part works as intended. I must admit that my hunt is based on partial understanding. Especially the wildcard stuff is black magic for me, and the book hasn't arrived. So it's more throwing sticks at the problem until it goes away.
BTW, I replaced this in inset() in my local copy:
if (!rs->x.wildcard) {
- return rs == r;
+ //return rs == r;
+ return rs->x.regnode == r->x.regnode;
}
It looks like a genuine bug to me also present in the original LCC. And it's probably completely harmless.
Right now I'm staring now at how spillr() operates. But I should go to sleep.
inset
is all my doing, so that's my bug. Thanks for fixing it :)
Don't worry: the original LCC expression has the same equality operator.
The reloads appear to be created in-place in the tree. That disproofs my reversal theory.
Ahh, now this is starting to make more sense. Here's the output of dumpcover
from the debug logs:
RETI2(MULI2(LOAD(INDIRI2(ADDRLP2(a))), DIVI2(LOAD(INDIRI2(ADDRLP2(b))), INDIRI2(ADDRLP2(c)))))
dumpcover(f080cdf8) = stmt: RETI2(reg) / (action)
dumpcover(f080cd68) = reg: MULI2(reg,reg) / (action)
dumpcover(f080d298) = reg: LOADI2(reg) / (action)
dumpcover(f080ca08) = reg: INDIRI2(ADDRLP2) / (action)
dumpcover(f080ccd8) = reg: DIVI2(reg,reg) / (action)
dumpcover(f080d208) = reg: LOADI2(reg) / (action)
dumpcover(f080cb28) = reg: INDIRI2(ADDRLP2) / (action)
dumpcover(f080cc48) = reg: INDIRI2(ADDRLP2) / (action)
The nodes will execute in the order produced by a left-to-right postorder visit of the tree, so that tree would execute like so:
1. INDIRI2(ADDRLP2(a)) ; load local a into vAC (req'd due to the ldloc helper)
2. LOADI2(reg) ; copy vAC into r1 for the eventual helper call to mul
3. INDIRI2(ADDRLP2(b)) ; load local b into vAC (req'd due to the ldloc helper)
4. LOADI2(reg) ; copy vAC into r1 for the eventual helper call to div
5. INDIRI2(ADDRLP2(c)) ; load local c into vAC (req'd due to the ldloc helper)
6. DIVI2(reg, reg) ; call the div helper, with the result in vAC
7. MULI2(reg, reg) ; call the mul helper, with the result in vAC
Because the result of (2) is still live in r1
at (4), r1
must be spilled. lcc
attacks the spill problem by creating a tree of the form ASGNI2(ADDRLP2(spill temp), LOAD(INDIRI2(VREGP(spillee))))
at the site of the spill, a tree of the form INDIRI2(ADDRLP2(1))
at the site of the reload, and running these through the usual codegen pipeline. I added patterns to recognize the spill code, but the reloads are so generic that I don't think it's possible to match and handle them specially. As a result, the usual targeting kicks in and asks for vAC all over the place, which triggers more spills and reloads ad nauseum.
It is possible that we will need to change the lcc spiller to do something gt1-specific. I'm not sure that the architecture-independent spiller will actually work out.
We may want to try adding a RELOAD
node to the lcc
IR. We could then pattern-match against this node and use the additional information to generate appropriate code that avoids trashing vAC when reloading a virtual register.
I goes wrong when it wants to inject a 3rd spill. The debug output is hard to read for me, even having stared at it for almost 20 hours now. Conceptually, there's no reason to use more than 2 temporary variables. And even if vAC is constantly allocated and freed, there shouldn't arise a lock if all goes in the right order. But somehow it does...
This is the sequence I think should be emitted. There is no need for special treatment when tracking the occupancy of vAC. From what I can see, the compiler is pretty close to emitting this. It just trips over its own feet at some point. Maybe it is releasing vAC just a bit later than really necessary. Something small.
1a. vAC <- local a
1b. temp 1 <- vAC
2a. vAC <- local b
2b. r1 <- vAC
3. vAC <- local c
4. call mul
5. temp 2 <- vAC
6a. vAC <- temp 1
6b. r1 <- vAC
7. vAC <- temp 2
8. call div
From the logs, I think that the RA is creating this sequence when it spills r1:
1. INDIRI2(ADDRLP2(a)) ; load local a into vAC (req'd due to the ldloc helper)
2. LOADI2(reg) ; copy vAC into r1 for the eventual helper call to mul
2a. ASGNI2(ADDRLP2(1), LOAD(INDIRI2(VREGP(r1)))) ; spill r1 to t1
3. INDIRI2(ADDRLP2(b)) ; load local b into vAC (req'd due to the ldloc helper)
4. LOADI2(reg) ; copy vAC into r1 for the eventual helper call to div
5. INDIRI2(ADDRLP2(c)) ; load local c into vAC (req'd due to the ldloc helper)
6. DIVI2(reg, reg) ; call the div helper, with the result in vAC
2b. INDIRI2(ADDRLP2(1)) ; reload t1 into vAC (req'd due to the ldloc helper)
7. MULI2(reg, reg) ; call the mul helper, with the result in vAC
(2b) forces a spill of vAC, so the sequence then changes to this:
1. INDIRI2(ADDRLP2(a)) ; load local a into vAC (req'd due to the ldloc helper)
2. LOADI2(reg) ; copy vAC into r1 for the eventual helper call to mul
2a. ASGNI2(ADDRLP2(1), LOAD(INDIRI2(VREGP(r1)))) ; spill r1 to t1
3. INDIRI2(ADDRLP2(b)) ; load local b into vAC (req'd due to the ldloc helper)
4. LOADI2(reg) ; copy vAC into r1 for the eventual helper call to div
5. INDIRI2(ADDRLP2(c)) ; load local c into vAC (req'd due to the ldloc helper)
6. DIVI2(reg, reg) ; call the div helper, with the result in vAC
6a. ASGNI2(ADDRLP2(2), LOAD(INDIRI2(VREGP(vAC)))) ; spil vAC to t2
2b. INDIRI2(ADDRLP2(1)) ; reload t1 into vAC (req'd due to the ldloc helper)
6b. INDIRI2(ADDRLP2(2)) ; reload t2 into vAC (req'd due to the ldloc helper)
7. MULI2(reg, reg) ; call the mul helper, with the result in vAC
This sequence contains an unresolvable register conflict: (2b) and (6b) both require vAC
at the same point, so the code is not allocatable. I think that what is missing is a copy of the result of (2b) into r1
. Though I am not entirely sure why that isn't being inserted, I suspect that it is because target
is not being called again on (7). I'll have to check the book for a better explanation of the difference between kids
and x.kids
and the behavior of genreload
.
You're onto something. The genreload
seems to mix something up. I now also dump the global 'head' tree at the beginning and end of genreload
. At the critical stage, it transforms the head
from
RETI2(MULI2(INDIRI2(ADDRLP2(1)), DIVI2(LOAD(INDIRI2(ADDRLP2(b))), INDIRI2(ADDRLP2(c)))))
into this:
RETI2(MULI2(INDIRI2(ADDRLP2(1)), INDIRI2(ADDRLP2(2))))
But here we expect a LOAD from vAC to r1 in the first argument to MUL and it isn't there. (Your 2b line).
However, during this transformation there is this line being logged:
(inserting load from vAC to r1)
(targeting ed6d4098->x.kids[0]=8a808de0 to r1)
From this I assumed all the time it was doing the right thing. But we don't see that LOAD node actually appear in the tree. This debug logging is so hard to follow...
Here is reprune
, which is called to refill the contents of kids
with the appropriate entries from x.kids
:
static int reprune(Node *pp, int k, int n, Node p) {
struct node x, *q = *pp;
if (q == NULL || k > n)
return k;
else if (q->x.inst == 0)
return reprune(&q->kids[1],
reprune(&q->kids[0], k, n, p), n, p);
if (k == n) {
debug(fprint(stderr, "(reprune changes %x from %x to %x)\n", pp, *pp, p->x.kids[n]));
*pp = p->x.kids[n];
x = *p;
(IR->x.target)(&x);
}
return k + 1;
}
The call to target
is what is triggering those inserting load/targeting
log entries. The copies are being created, but they are immediately discarded due to the fact that the parent node (in our case, the MUL
) is copied to a local variable which is then passed to target
. Simply removing that copy seems like it is the first step towards fixing this issue, but it is certainly not the last: anytime a new node is inserted, it needs to be passed through rewrite
. However, that may not be as simple as it sounds, as rewrite
is recursive and already-rewritten nodes are not really supposed to be run through rewrite
a second time as per an assertion in that function.
A set of changes that fixes this specific instance is here: https://github.com/kervinck/gigatron-rom/compare/master...pgavlin:pgavlin/spill
I am not terribly confident in these changes, however. I think that the result of target
needs to be checked so both kids
can be labeled and reduced as necessary.
I have some doubts as well. It feels not quite right to protrude Gigatron changes outside gt1.md all the way into lburg...
The lburg
changes are pretty innocuous. I had already made some target-independent changes to make it possible to generate code from callbacks based on a pattern match; the changes in that branch simply enhance the logging to make it clearer which callback is being invoked.
This small set of edits is the fix: https://github.com/kervinck/gigatron-rom/compare/master...pgavlin:pgavlin/spill#diff-26094ddff7f8a9016fd6d0f421cb26d2R830
The fix should be target-independent as well, but I'm not totally confident in that.
For my understanding (my book hasn't arrived yet, and I dearly start to need it):
- what's the difference between 'action' and 'emit2()'? Are these called at a different stage?
- in the patch I spotted it can now generate moves through 'ht'. Does it generate such moves now for this specific case? (Because I think it's never needed as long as vAC is the last argument to such operators).
I tried to test, but when I clone your repo en checkout 'spill', 'make lcc' works but 'make ctest' crashes:
make ctest
Utils/lcc/build/lcc -ILibs -c Libs/Example.c -o Libs/Example.o
Utils/lcc/build/lcc -ILibs -c Libs/Gigatron/BusyWait.c -o Libs/Gigatron/BusyWait.o
Utils/lcc/build/lcc -ILibs -c Libs/Gigatron/ClearScreen.c -o Libs/Gigatron/ClearScreen.o
Utils/lcc/build/lcc -ILibs -c Libs/Gigatron/Newline.c -o Libs/Gigatron/Newline.o
Utils/lcc/build/lcc -ILibs -c Libs/Gigatron/PutChar.c -o Libs/Gigatron/PutChar.o
Utils/lcc/build/lcc -ILibs -c Libs/Gigatron/Random.c -o Libs/Gigatron/Random.o
Utils/lcc/build/lcc -ILibs -c Libs/Gigatron/WaitKey.c -o Libs/Gigatron/WaitKey.o
Utils/lcc/build/lcc -ILibs -c Libs/ctype/isdigit.c -o Libs/ctype/isdigit.o
Utils/lcc/build/lcc -ILibs -c Libs/errno/errno.c -o Libs/errno/errno.o
Utils/lcc/build/lcc -ILibs -c Libs/stdio/fclose.c -o Libs/stdio/fclose.o
Utils/lcc/build/lcc: fatal error in Utils/lcc/build/rcc
make: *** [Libs/stdio/fclose.o] Error 1
I will be offline for a couple of days for surgery. Please feel free to commit to the main repo if you think it's good. You're the best judge of that :-)
The book was supposed to arrive today, but it didn't. I've been doing some testing instead, focusing on the effect of just the proposed reprune() change for now (that seems to be the meat of the fix):
int main(void)
{
int a, b, c;
return a * (b / c);
}
Gives:
asm.defun('_main')
asm.push()
asm.ldwi(0x2)
asm.call('enter')
asm.ldw('sp')
asm.subi(10)
asm.stw('sp')
asm.ldi(2)
asm.call('ldloc')
asm.stw('r1')
asm.stw('ha') <-- spilling r1, but how does it know vAC equals r1?
asm.ldi(8)
asm.call('stloc')
asm.ldi(4)
asm.call('ldloc')
asm.stw('r1')
asm.ldi(6)
asm.call('ldloc')
asm.call('div')
asm.stw('ha')
asm.ldi(10)
asm.call('stloc')
asm.ldi(8)
asm.call('ldloc')
asm.stw('r1')
asm.ldi(10)
asm.call('ldloc')
asm.call('mul')
asm.stw('rv')
asm.label('.L1')
asm.ldi(10)
asm.addw('sp')
asm.stw('sp')
asm.ldwi(0x8000)
asm.call('leave')
asm.ldw('rv')
asm.popret()
The above works, and is roughly what you expect. The asm.stw('r1')
is unneeded, but fine, I can imagine why it puts it there. However, I mistrust the code that follows and spills r1, annotated in the above. Even if it works out here, it looks "funny": I would expect a simple and naive asm.ldw('r1')
in there before the asm.stw('ha')
etc. But this ldw() isn't there. My confusion is "How does it know at that point, when spilling r1, that vAC already contains the value of r1?". It looks like a shortcut I haven't seen it take elsewhere. Hence my mistrust.
Just for laughs, the following from #77 results in a spilling fest.
int main(int c)
{
if (c == -1)
return 1;
}
It generates no less than 3 temporary variables.
asm.defun('_main')
asm.push()
asm.ldwi(0x80)
asm.call('enter')
asm.ldw('sp')
asm.subi(6)
asm.stw('sp')
asm.ldi(10)
asm.call('ldloc')
asm.stw('ha')
asm.ldi(2)
asm.call('stloc') <-- one
asm.ldwi(-1)
asm.stw('ha')
asm.ldi(4)
asm.call('stloc') <-- two
asm.ldi(2)
asm.call('ldloc')
asm.stw('ha')
asm.ldi(6)
asm.call('stloc') <-- three
asm.ldi(4)
asm.call('ldloc')
asm.stw('r7')
asm.ldi(6)
asm.call('ldloc')
asm.subw('r7')
asm.jne('.L2')
asm.ldi(1) <-- return 1
asm.stw('rv')
asm.ldwi('.L1')
asm.jr()
asm.label('.L2')
asm.ldi(0) <-- an overzealous LCC. It could just as well leave 'rv' uninitialised.
asm.stw('rv')
asm.label('.L1')
asm.ldi(6)
asm.addw('sp')
asm.stw('sp')
asm.ldwi(0x200)
asm.call('leave')
asm.ldi(2)
asm.addw('sp')
asm.stw('sp')
asm.ldw('rv')
asm.popret(
One temp is to store the constant -1. The others for getting the operands for subtraction in the right order. Ironically, the order doesn't matter, because we're only testing for equality. One mitigation is to replace SUB with BXOR here. (I already have that prepared in another workspace).
I found this article by Fraser and Hanson on LCC's register spilling: Simple Register Spilling in a Retargetable Compiler http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.6848&rep=rep1&type=pdf
This might be helpful to understand what is going on in gen.c
This is a nice reference article on the code generation: The lcc 4.x Code-Generation Interface https://drhanson.s3.amazonaws.com/storage/documents/interface4.pdf
And here a report on retargeting LCC to a DSP. It might be of interest as they had to overcome register constraints as well. LCC is introduced in chapter 4. Retargeting a C Compiler for a DSP Processor http://www.diva-portal.org/smash/get/diva2:19902/FULLTEXT01