unoptimized for loop iterator after library function call
for (i = 0; i < MAX_BALLS; i++)
{
balls[i].x_velocity = rand16() & 0x3FF;
balls[i].y_velocity = rand16() & 0x0FF;
}
The iterator of the for loop compiles to this:
LDX $31
DEX
STX $31
LDX $31
BNE main+66
It was smart about changing my increment into a decrement, but it doesn't need to get the X register involved. The optimal code would simply be:
DEC $31 BNE
The full program is attached to this forum post (ballsc.zip) https://forums.nesdev.org/viewtopic.php?p=290137#p290137 code is at line 66 in ballsc.c, and ends up at $8233 in the compiled NES binary.
This program has other for loops in it that seem fine, it seems like only this one comes out with suboptimal code.
If you replace rand16() with e.g. a constant 1, does the inefficiency go away?
Maybe it might be the case that the compiler can't prove that rand16() doesn't touch memory and potentially i, so it thinks it needs to store X and reload it later.
Any chance you'd be able to reduce the test case into a minimal code line test case, so that it could e.g. be verifiable on https://godbolt.org/ ? That way it will be easier to see if the issue still persists in the future when newer versions of llvm-mos come up on godbolt.
Maybe it might be the case that the compiler can't prove that
rand16()doesn't touch memory and potentiallyi, so it thinks it needs to store X and reload it later.
Even if it needs to use the value in $31 elsewhere, shouldn't it at least be able to eliminate the 2nd ldx? (unless i was declared volatile, maybe?)
Yeah, probably.. that was just a guess.
If you replace
rand16()with e.g. a constant1, does the inefficiency go away?
Good question, yeah it becomes a simple DEX / BNE in that case.
Maybe it might be the case that the compiler can't prove that
rand16()doesn't touch memory and potentiallyi, so it thinks it needs to store X and reload it later.
neslib's rand.s is in assembly. rand16 uses X, and JSRs to rand8 twice. Both of those subroutines are defined like: .section .text.rand8,"ax",@progbits
Any chance you'd be able to reduce the test case into a minimal code line test case, so that it could e.g. be verifiable on https://godbolt.org/ ? That way it will be easier to see if the issue still persists in the future when newer versions of llvm-mos come up on godbolt.
Sure, I put a reduced version here. Well, rand16 may be a red herring, because it's in there and the iterator test suddenly looks more reasonable (DEC ZP then loads it to Y, but Y value this time is actually used at the top of the loop). https://godbolt.org/z/PTWxKvdbr
While editing it down, I noticed something strange. Code before the loop is affecting the loop, it doesn't seem like it should be. This copy has one of the neslib function calls uncommented, and uncommenting any one of those function calls seems to have this same effect. With this, now it's back to doing the LDX / DEX / STX thing (actual X value gets trashed by rand16). https://godbolt.org/z/rdoaaqheG
I'm also now noticing there's more inefficiency in the end of this same loop. Same code as linked above: https://godbolt.org/z/rdoaaqheG
ldx mos8(.Lmain_zp_stk+2) ; 1-byte Folded Reload
dex
stx mos8(.Lmain_zp_stk+2) ; 1-byte Folded Spill
ldx mos8(.Lmain_zp_stk+2) ; 1-byte Folded Reload
bne .LBB0_1
ldx #0
rts
X was tested by BNE, X must be zero to continue, but the next instruction is a redundant LDX #0. Seems related to this issue maybe, but I could open that as a separate issue, if that would help.
I tried a copy/paste of the loop in question, so it runs another copy of that loop. The first copy does the unoptimized X register stuff, and the second copy uses DEC ZP. This happens after a neslib call, using any single one of them will do it. If there are no neslib calls beforehand, both for loops will use DEC ZP.
If I put another neslib call before the second copy of the loop, that makes the second copy also do the redundant X register stuff.
https://godbolt.org/z/d3n3Ezfn3