cub icon indicating copy to clipboard operation
cub copied to clipboard

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

Results 91 cub issues
Sort by recently updated
recently updated
newest added

I have `N` input buffers that I want to copy to `N` output buffers. I could sequentially call `cudaMemcpyAsync` `N` times, but in most cases it would be faster to...

type: enhancement
P1: should have
helps: rapids

There are a number of tests that we currently aren't building because they are `ifdef`'d out -- @senior-zero discovered a few uncovered cases in `test_block_scan`, and @canonizer just reported another...

type: bug: functional
P0: must have
area: tests

While addressing the following [issue](https://github.com/NVIDIA/cub/pull/514) I realized that we haven't implemented tests with `__CUDA_NO_HALF_CONVERSIONS__`. The original motivation is described [here](https://github.com/NVIDIA/cub/issues/394).

type: enhancement
P1: should have
area: tests

I am trying `cub::BlockRadixSort` with PyTorch, it is getting good performance, but I find it is hard to use: For example, if I want to sort 1023 elements, then I...

type: enhancement
P2: nice to have
helps: pytorch

Dear Maintainers, thank you for the awesome library, I really like it :) I have a strange launch failure when using `cub::BlockReduce BlockReduce` together with CUDA Dynamic Parallelism (CDP). When...

info needed
triage

Per #503, it should be possible to introduce in-place versions of these algorithms.

type: enhancement
P3: backlog

Currently, scan is organized as follows: ```cpp int max_dim_x; if (CubDebug(error = cudaDeviceGetAttribute(&max_dim_x, cudaDevAttrMaxGridDimX, device_ordinal))) break; // Run grids in epochs (in case number of tiles exceeds max x-dimension int...

type: enhancement
P2: nice to have

only: cmake
blocked
P2: nice to have
only: gpuci

Currently, `cub::DeviceSegmentedRadixSort` launches `num_segments` blocks and each block works on one segment. This approach does not have good performance when the number of segments is small: https://github.com/pytorch/pytorch/issues/63456. For small number...

type: enhancement
area: performance
P3: backlog

- Current documentation includes benchmarks between out-dated versions of cub and thrust like v1.7.1 (DeviceReduce https://nvlabs.github.io/cub/structcub_1_1_device_reduce.html) - Current Thrust uses cub internally for some algorithms (e.g DeviceReduce) - It would...

type: enhancement
P1: should have
area: performance
area: docs