enoki
enoki copied to clipboard
Kernel reuse
It's unclear to me if cuda kernels can ever be reused?
It seems the cuda code is compiled every time, even if I call the same code (with different data) every display frame?
They are reused if your computation results in the same assembly code. You can see this by increasing the debug level a bit and checking if there are "cache hit" messages. One thing to avoid are literal constants that change from iteration to iteration (which lead to different PTX code being generated).
If you give a small example of your problematic code, it will be easier to give feedback btw.
I was wrong about not caching, cuda_eval() does take less when called repeatedly for same code+data. 170ms when shift value changes, and 17ms when the shift value does not change.
typedef enoki::CUDAArray<int32_t> IV; // Int vector
typedef enoki::CUDAArray<float> FV; // Flt vector
and the top-level func:
void osino_computefield(void)
{
#ifdef SHIFTED
static float shift=0.0f;
shift += 0.003f;
#endif
if (!field) field = new FV;
const int mag = BLKMAG;
const int sz = (1<<mag);
const int msk = sz-1;
const int cnt = sz * sz * sz;
const float lo = 0.5f * (sz-1);
const IV ix = enoki::arange<IV>( cnt );
const IV zc = ix & msk;
const IV yc = enoki::sr<mag>(ix) & msk;
const IV xc = enoki::sr<mag+mag>(ix) & msk;
const float s0 = 1.000f / lo;
const float s1 = 1.003f / lo;
const float s2 = 1.005f / lo;
FV x = ( FV(xc) - lo ) * s0;
FV y = ( FV(yc) - lo ) * s1;
FV z = ( FV(zc) - lo ) * s2;
const FV lsq_unwarped = x*x + y*y + z*z; // 0 .. 0.25
const FV depth = 0.25f - lsq_unwarped;
const FV warpstrength = 0.39f + enoki::max(0, depth) * 8.2f;
const FV wx = osino_3d(11+y, 23-z, 17+x) * warpstrength;
const FV wy = osino_3d(19-z, 13+x, 11-y) * warpstrength;
const FV wz = osino_3d(31+x, 41-z, 61+y) * warpstrength;
x += wx;
y += wy;
z += wz;
const FV lsq = x*x + y*y + z*z;
const FV len = enoki::sqrt(lsq);
const FV d = 2.0f - 4.0f * len;
#ifdef SHIFTED
const FV v = osino_3d_4o(1.2f*x+shift,1.2f*y,1.2f*z);
#else
const FV v = osino_3d_4o(1.2f*x,1.2f*y,1.2f*z);
#endif
*field = enoki::clamp(v + d, -1, 1);
TT_BEGIN("cuda_eval");
enoki::cuda_eval(); // may return before the GPU finished executing the kernel.
TT_END ("cuda_eval");
}
Which makes me wonder... is there a way that some parameters change for that code that doesn't trigger what I assume is a recompile?
I use it to generate Simple Noise fields, and it would be nice if I could generate the field with different offsets, without triggering a costly operation?
Also note: The manual says that 'cuda_eval' may return early, async, but my profiling says the bulk of the cycles are spent in there?
cuda_eval(): launching kernel (n=2097152, in=0, out=18, ops=4274)
I see in jit.cu that the call to cuLaunchKernel()
passes null for the kernelParameters
argument.
So how are calls to the kernels parameterized?
https://github.com/mitsuba-renderer/enoki/blob/master/src/cuda/jit.cu#L1372