cuda-api-wrappers
cuda-api-wrappers copied to clipboard
Thin, unified, C++-flavored wrappers for the CUDA APIs
Even the most basic "fancy iterator" from Thrust ~~`thrust::constant_iterator`~~ `thrust::counting_iterator` doesn't fulfill the requirement making algorithms written using cuda-api-wrappers for launching kernels less flexible. constant_iterator constructor ```cuda __host__ __device__ constant_iterator(constant_iterator...
NVIDIA, in their infinite wisdom, have decided to kind-of-clone the `nvLink*` driver API functions, into `nvJitLink*` functions, doing basically the same thing but with LTO support. Not sure why they...
The CUDA samples directory has a sampled named cuHook, intended for dynamic loading via LD_PRELOAD, which lets you install pre-hooks and post-hooks for CUDA driver calls. Perhaps we should add...
At the moment, we don't properly support empty execution graph nodes on our graph_support branch. But those do exist and can be inserted.
At the moment, we use a custom command to generate fatbins from compiled objects (e.g. [here](https://github.com/eyalroz/cuda-api-wrappers/blob/master/examples/CMakeLists.txt#L112)). CMake 3.27 has introduced a built-in mechanism for generating them, described [here](https://cmake.org/cmake/help/v3.27/prop_tgt/CUDA_FATBIN_COMPILATION.html#prop_tgt:CUDA_FATBIN_COMPILATION). Let's switch...
In the `streamOrderedAllocationIPC` example program, we sometimes(/always) get an error when importing a mempool exported for IPC (from a "shareable handle). Unfortunately, it's an "unknown error", which the `cuMemPoolImportFromShareableHandle()` function...
When I run vectorAddMMAP on a machine with multiple Ampere GPUs which P2P access, I get: ``` terminate called after throwing an instance of 'cuda::runtime_error' what(): Failed setting the access...
With the Hopper architecture, NVIDIA has introduced "clusters" of blocks which can use each other's shared memory. The clustering can be set either using a `__cluster_dims__(1,2,3)` qualifier in the kernel's...
CUDA execution graph templates can be created in one of two ways: Explicit construction and capture of operations enqueued via a stream. I'm currently working on the explicit construction API...
See [this discussion](https://stackoverflow.com/questions/75498717/whats-the-replacement-for-cumodulegetsurfref-and-cumodulegettexref/75500158#75500158). We should make it possible to use text/surface/tensor objects, and also avoid the old APIs with CUDA 12 and later.