sm64
sm64 copied to clipboard
Hardware sqrt helper functions
In #32 it was shown that you can use the hardware sqrt operation to replace a software one in nds_renderer.c
. The default libnds sqrt function has extra checks that are unnecessary and does not support async computation. Libnds' sqrt does a check to make sure the hardware divider is not busy before sending the value, it only takes ~30 main bus cycles according to my testing, this means that unless you have two hardware divides back to back, there is no reason to check this value twice. Blocksds uses a simpler approach by having async functions for sending values to the hardware math coprocessor, this means we can use the cpu while waiting for the hardware math to complete. With testing I have found that replacing the sqrt call referenced in #32 with an async send, then check if (s > 0) and then wait for the hardware to finish the operation. In testing this saves 10-20 microseconds per frame.
// Normalize the result
int s = (lights[i].nx * lights[i].nx + lights[i].ny * lights[i].ny + lights[i].nz * lights[i].nz) >> 8;
// Send squareroot value to hardware before comparing (s > 0), this saves 10-20 microseconds
// Devkitpro's libnds does not have helper functions for async hardware math. This should be
// put into a helper function.
REG_SQRTCNT = SQRT_64;
REG_SQRT_PARAM = (s64)s << 16;
if (s > 0) {
while (REG_SQRTCNT & SQRT_BUSY);
s = REG_SQRT_RESULT;
lights[i].nx = (lights[i].nx << 16) / s;
lights[i].ny = (lights[i].ny << 16) / s;
lights[i].nz = (lights[i].nz << 16) / s;
}
If we do not switch to Blocksds, I propose we at least have these functions in a header file to have better operability with the hardware. The question I have is, where should this function go, so that it can be used by more than just nds_renderer.c
if it comes to be useful later on? Should this be a function or a preprocessor define: #define sqrt_asynch(x) ...
?
I think it's fine to do it inline in this case, and maybe consider a function if it becomes necessary. Though to be honest, the performance difference is so small that I don't think it really matters.
Related to this, I wrote a function that could use the hardware squareroot to compute a floating point squareroot. It may be worth overriding the sqrtf used by <math.h>
Where would be a good place to put the implementation?
f32 fsqrt(f32 x){
union{f32 f; u32 i;}xu;
xu.f=x;
//grab exponent
s32 exponent= (xu.i & (0xff<<23));
if(exponent==0)return 0.0;
exponent=exponent-(127<<23);
exponent=exponent>>1; //right shift on negative number depends on compiler
u64 mantissa=xu.i & ((1<<23)-1);
mantissa=(mantissa+(1<<23))<<23;
if ((exponent & (1<<22))>0){
mantissa=mantissa<<1;
}
u32 new_mantissa= (u32) sqrt(mantissa); //modify this line to use hardware sqrt
xu.i= ((exponent+(127<<23))& (0xff<<23) ) | (new_mantissa & ((1<<23)-1));
return xu.f;
}