cub
cub copied to clipboard
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
I have `N` input buffers that I want to copy to `N` output buffers. I could sequentially call `cudaMemcpyAsync` `N` times, but in most cases it would be faster to...
There are a number of tests that we currently aren't building because they are `ifdef`'d out -- @senior-zero discovered a few uncovered cases in `test_block_scan`, and @canonizer just reported another...
While addressing the following [issue](https://github.com/NVIDIA/cub/pull/514) I realized that we haven't implemented tests with `__CUDA_NO_HALF_CONVERSIONS__`. The original motivation is described [here](https://github.com/NVIDIA/cub/issues/394).
I am trying `cub::BlockRadixSort` with PyTorch, it is getting good performance, but I find it is hard to use: For example, if I want to sort 1023 elements, then I...
Dear Maintainers, thank you for the awesome library, I really like it :) I have a strange launch failure when using `cub::BlockReduce BlockReduce` together with CUDA Dynamic Parallelism (CDP). When...
Per #503, it should be possible to introduce in-place versions of these algorithms.
Currently, scan is organized as follows: ```cpp int max_dim_x; if (CubDebug(error = cudaDeviceGetAttribute(&max_dim_x, cudaDevAttrMaxGridDimX, device_ordinal))) break; // Run grids in epochs (in case number of tiles exceeds max x-dimension int...
Currently, `cub::DeviceSegmentedRadixSort` launches `num_segments` blocks and each block works on one segment. This approach does not have good performance when the number of segments is small: https://github.com/pytorch/pytorch/issues/63456. For small number...
- Current documentation includes benchmarks between out-dated versions of cub and thrust like v1.7.1 (DeviceReduce https://nvlabs.github.io/cub/structcub_1_1_device_reduce.html) - Current Thrust uses cub internally for some algorithms (e.g DeviceReduce) - It would...