llvm-mos unoptimized for loop iterator after library function call

	for (i = 0; i < MAX_BALLS; i++)
	{
		balls[i].x_velocity = rand16() & 0x3FF;
		balls[i].y_velocity = rand16() & 0x0FF;
	}

The iterator of the for loop compiles to this:

  LDX $31                  
  DEX                      
  STX $31                  
  LDX $31                  
  BNE main+66

It was smart about changing my increment into a decrement, but it doesn't need to get the X register involved. The optimal code would simply be:

DEC $31 BNE

The full program is attached to this forum post (ballsc.zip) https://forums.nesdev.org/viewtopic.php?p=290137#p290137 code is at line 66 in ballsc.c, and ends up at $8233 in the compiled NES binary.

This program has other for loops in it that seem fine, it seems like only this one comes out with suboptimal code.

Oct 16 '23 16:10 Memblers

If you replace rand16() with e.g. a constant 1, does the inefficiency go away?

Maybe it might be the case that the compiler can't prove that rand16() doesn't touch memory and potentially i, so it thinks it needs to store X and reload it later.

Any chance you'd be able to reduce the test case into a minimal code line test case, so that it could e.g. be verifiable on https://godbolt.org/ ? That way it will be easier to see if the issue still persists in the future when newer versions of llvm-mos come up on godbolt.

Oct 16 '23 16:10 juj

Maybe it might be the case that the compiler can't prove that rand16() doesn't touch memory and potentially i, so it thinks it needs to store X and reload it later.

Even if it needs to use the value in $31 elsewhere, shouldn't it at least be able to eliminate the 2nd ldx? (unless i was declared volatile, maybe?)

Oct 16 '23 17:10 cogwheel

Yeah, probably.. that was just a guess.

Oct 16 '23 19:10 juj

If you replace rand16() with e.g. a constant 1, does the inefficiency go away?

Good question, yeah it becomes a simple DEX / BNE in that case.

Maybe it might be the case that the compiler can't prove that rand16() doesn't touch memory and potentially i, so it thinks it needs to store X and reload it later.

neslib's rand.s is in assembly. rand16 uses X, and JSRs to rand8 twice. Both of those subroutines are defined like: .section .text.rand8,"ax",@progbits

Any chance you'd be able to reduce the test case into a minimal code line test case, so that it could e.g. be verifiable on https://godbolt.org/ ? That way it will be easier to see if the issue still persists in the future when newer versions of llvm-mos come up on godbolt.

Sure, I put a reduced version here. Well, rand16 may be a red herring, because it's in there and the iterator test suddenly looks more reasonable (DEC ZP then loads it to Y, but Y value this time is actually used at the top of the loop). https://godbolt.org/z/PTWxKvdbr

While editing it down, I noticed something strange. Code before the loop is affecting the loop, it doesn't seem like it should be. This copy has one of the neslib function calls uncommented, and uncommenting any one of those function calls seems to have this same effect. With this, now it's back to doing the LDX / DEX / STX thing (actual X value gets trashed by rand16). https://godbolt.org/z/rdoaaqheG

Oct 16 '23 19:10 Memblers

I'm also now noticing there's more inefficiency in the end of this same loop. Same code as linked above: https://godbolt.org/z/rdoaaqheG

        ldx     mos8(.Lmain_zp_stk+2)           ; 1-byte Folded Reload
        dex
        stx     mos8(.Lmain_zp_stk+2)           ; 1-byte Folded Spill
        ldx     mos8(.Lmain_zp_stk+2)           ; 1-byte Folded Reload
        bne     .LBB0_1
        ldx     #0
        rts

X was tested by BNE, X must be zero to continue, but the next instruction is a redundant LDX #0. Seems related to this issue maybe, but I could open that as a separate issue, if that would help.

Oct 16 '23 20:10 Memblers

I tried a copy/paste of the loop in question, so it runs another copy of that loop. The first copy does the unoptimized X register stuff, and the second copy uses DEC ZP. This happens after a neslib call, using any single one of them will do it. If there are no neslib calls beforehand, both for loops will use DEC ZP.

If I put another neslib call before the second copy of the loop, that makes the second copy also do the redundant X register stuff.

https://godbolt.org/z/d3n3Ezfn3

Oct 17 '23 21:10 Memblers