pycuda Option to precompile cuRand and gpu array functions

Hi,

I want to remove the requirement to have MSVC and NVCC compilers available in the runtime environment so I can distribute a program I'm writing in pycuda. I've managed to compile my custom kernels into .fatbin files and import them using module_from_buffer.

However, it looks like some other pycuda functions still rely on generating and compiling cuda kernels at runtime. Specifically I'm having trouble with the cu_rand integration, as well as gpu_array.fill(x) function. Presumably a lot more of the gpu_array helper functions will have the same problem.

Is there a way to package the kernels used by these functions into .fatbin files, and to rely on those files rather than runtime compilation? and/or what code changes would be required to pycuda to support this?

Thanks!

Aug 27 '21 14:08 carsonswope

I suspect the most fruitful approach would be to modify the kernel caching layer to support this. Maybe allow setting a mode where all used kernels can be "recorded". (This would have to happen at context creation time, otherwise some kernels may already be loaded and might get missed.) This recording would then generate the appropriate fat binary files that can be shipped with an application, likely stored as a cache (which would have to be free of collisions) based on the provided source code. IMO, this would allow for minimal interface changes on the application side while avoiding a hard dependency on the compiler at runtime.

You could also revive this NVRTC patch set and base your work on that, then you'd only need to save PTX. (though, to be fair, the generated PTX might/will still vary by architecture)

Aug 27 '21 18:08 inducer

Thanks for the quick response! I will look into setting up a kernel caching layer. Ideally it would also work the same for custom kernels compiled with SourceModule. The only issue I'm seeing with having the cache being keyed off the provided source code is usage of include statements / and other preprocessor directives. I'm including the C++ library GLM as well as a few of my own utils.hpp files in my kernels, so if anything in those dependencies change then the cache wouldn't be properly cleared unless the cache is keyed off of the full preprocessor output, but that would require access to the compiler, which is what we are trying to avoid. Seems like an edge case but wondering if you have any ideas about that.

[edit: okay, nevermind about the cache 'busting' logic. The important functionality is less of a cache and more just being able to record and store binaries for all kernels compiled by a given application.]

Is this something you'd be interested in accepting as a PR?

(NVRTC looks interesting, but still requires users to have nvcc installed, which makes it not ideal for me)

Aug 30 '21 15:08 carsonswope