enoki icon indicating copy to clipboard operation
enoki copied to clipboard

Kernel reuse

Open stolk opened this issue 4 years ago • 4 comments

It's unclear to me if cuda kernels can ever be reused?

It seems the cuda code is compiled every time, even if I call the same code (with different data) every display frame?

stolk avatar Jun 08 '20 06:06 stolk

They are reused if your computation results in the same assembly code. You can see this by increasing the debug level a bit and checking if there are "cache hit" messages. One thing to avoid are literal constants that change from iteration to iteration (which lead to different PTX code being generated).

wjakob avatar Jun 08 '20 08:06 wjakob

If you give a small example of your problematic code, it will be easier to give feedback btw.

wjakob avatar Jun 08 '20 08:06 wjakob

I was wrong about not caching, cuda_eval() does take less when called repeatedly for same code+data. 170ms when shift value changes, and 17ms when the shift value does not change.

typedef enoki::CUDAArray<int32_t>  IV;  // Int vector
typedef enoki::CUDAArray<float>    FV;  // Flt vector

and the top-level func:

void osino_computefield(void)
{
#ifdef SHIFTED
        static float shift=0.0f;
        shift += 0.003f;
#endif
        if (!field) field = new FV;
        const int mag = BLKMAG;
        const int sz = (1<<mag);
        const int msk = sz-1;
        const int cnt = sz * sz * sz;
        const float lo = 0.5f * (sz-1);
        const IV ix = enoki::arange<IV>( cnt );
        const IV zc = ix & msk;
        const IV yc = enoki::sr<mag>(ix) & msk;
        const IV xc = enoki::sr<mag+mag>(ix) & msk;
        const float s0 = 1.000f / lo;
        const float s1 = 1.003f / lo;
        const float s2 = 1.005f / lo;
        FV x = ( FV(xc) - lo ) * s0;
        FV y = ( FV(yc) - lo ) * s1;
        FV z = ( FV(zc) - lo ) * s2;

        const FV lsq_unwarped = x*x + y*y + z*z; // 0 .. 0.25
        const FV depth = 0.25f - lsq_unwarped;
        const FV warpstrength = 0.39f + enoki::max(0, depth) * 8.2f;

        const FV wx = osino_3d(11+y, 23-z, 17+x) * warpstrength;
        const FV wy = osino_3d(19-z, 13+x, 11-y) * warpstrength;
        const FV wz = osino_3d(31+x, 41-z, 61+y) * warpstrength;

        x += wx;
        y += wy;
        z += wz;

        const FV lsq = x*x + y*y + z*z;
        const FV len = enoki::sqrt(lsq);
        const FV d = 2.0f - 4.0f * len;
#ifdef SHIFTED
        const FV v = osino_3d_4o(1.2f*x+shift,1.2f*y,1.2f*z);
#else
        const FV v = osino_3d_4o(1.2f*x,1.2f*y,1.2f*z);
#endif
        *field = enoki::clamp(v + d, -1, 1);
        TT_BEGIN("cuda_eval");
        enoki::cuda_eval(); // may return before the GPU finished executing the kernel.
        TT_END  ("cuda_eval");
}

Which makes me wonder... is there a way that some parameters change for that code that doesn't trigger what I assume is a recompile?

I use it to generate Simple Noise fields, and it would be nice if I could generate the field with different offsets, without triggering a costly operation?

Also note: The manual says that 'cuda_eval' may return early, async, but my profiling says the bulk of the cycles are spent in there?

cuda_eval(): launching kernel (n=2097152, in=0, out=18, ops=4274)

stolk avatar Jun 08 '20 15:06 stolk

I see in jit.cu that the call to cuLaunchKernel() passes null for the kernelParameters argument. So how are calls to the kernels parameterized?

https://github.com/mitsuba-renderer/enoki/blob/master/src/cuda/jit.cu#L1372

stolk avatar Jun 09 '20 01:06 stolk