cuda-kat
cuda-kat copied to clipboard
CUDA kernel author's tools
`atomic::compare_and_swap()` is not covered by the unit tests right now. We need to test it. Also, consider the insights here: https://stackoverflow.com/questions/62091548/atomiccas-for-bool-implementatin
My unit tests for `kat::tuple` `kat::span` and ``kat::array` don't include any case for passing these structures as arguments to a kernel! That needs to be rectified...
It would be useful to have a more C++'ish mechanism to obtain random numbers.
Copying or moving data has (at least) the following variants: 1. Element size in bytes 1.1 size is a power of 2 1.2 size is a natively-supported power of 2...
I wonder if we should consider a version of `append_to_global_memory()` where each thread may have its data elsewhere (at an address); and perhaps also a version where each thread has...
The `have_a_single_lane_compute` primitive currently returns a value. But - this value is only valid for the single computing lane, and the caller doesn't even know which lane that is. That...
I mis-designed the reduction and scan functions to take a neutral value as a template parameter; this works for integral types, not for floating-point types (until C++20). For the time...
When compiling for compute capability 3.0 cards, we get failures due to some instructions not being supported. Let's work around this.
Need to correct a typo as follows: - __nanosleep(unsigned int ns); + __nanosleep(num_cycles); in `time.cuh`. Detected with compiling with CC 7.x
The four functions, `__isGlobal()` , `__isLocal()`, `__isShared()`, `__isConstant()`, are available from CUDA (as of 10.2 it seems) - and we don't need to defined them ourselves under `on_device/ptx/`. Probably.