cub icon indicating copy to clipboard operation
cub copied to clipboard

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

Results 91 cub issues
Sort by recently updated
recently updated
newest added

I've found a corner case in the `cub::DeviceSegmentedSort` that can be optimized. If input data contains a lot of unit segments, the performance isn't optimal since we have to read...

type: enhancement
P2: nice to have
area: performance

During the development of the [new segmented sort](https://github.com/NVIDIA/cub/pull/357), I extracted an `AgentSegmentedRadixSort` class. It's mostly based on the existing `DeviceSegmentedRadixSort` implementation. The only differences are: 1) `while (current_bit < end_bit)`...

type: enhancement
P2: nice to have
area: performance

After DeviceFree, an allocation can in theory be reused on the same stream it was previously allocated. stream-ordering ensures that all operations before DeviceFree are completed before accessing the reused...

type: bug: functional
P1: should have
repro: unverified

This issue is a follow up of the https://github.com/NVIDIA/cub/issues/369 The documentation of cub::GridBarrier is unclear to understand the grid size limitation which could be throttled by the SM count, block...

type: enhancement
P2: nice to have
area: docs

The `AgentScan` structure doesn't distinguish between inclusive and exclusive variants at the load stage. Therefore, current version of the `cub::DeviceScan::ExclusiveScan` requires `num_items` items in the `input` data. According to the...

type: enhancement
P2: nice to have
helps: rapids

In "/cub/grid/grid_barrier.cuh" I noticed some incorrect semantics being used. A flag "d_sync" is used to communicate between threadblocks and enforce syncing. In lines 98 and 120, LOAD_CG operations are used...

type: bug: functional
P1: should have

In the following code snippet: https://github.com/NVIDIA/cub/blob/2200c6af27710264023314f1598c3ed1f46560cb/cub/util_allocator.cuh#L617-L625 `recahced` can be false due to the fact that `d_ptr` is not in `live_blocks`, when the earlier `live_blocks.find()` would fail. - If it is...

type: bug: functional
P1: should have
triage
repro: unverified

Classes from util_type.cuh are not listed in https://nvlabs.github.io/cub/classes.html even though they have doxygen comments. One example is DoubleBuffer: https://github.com/NVIDIA/cub/blob/main/cub/util_type.cuh#L791

type: enhancement
P1: should have
area: docs

`BlockExchange` provides template parameter `WARP_TIME_SLICING`. It reduces the shared memory footprint. Most of the algorithms in the `BlockExchange` have specializations for different `WARP_TIME_SLICING` values. But it isn't the case for...

type: bug: functional
P2: nice to have

The following code fails when invoking `cub::DeviceHistogram::HistogramEven`. **NOTE**: It fails **ONLY** for some values of `n` and `dim` in the code below. @danpovey ```bash (py38) fangjun:~/open-source/k2/build_debug$ ./bin/cu_cub_test Invoking DeviceHistogramInitKernel() Invoking...

type: bug: functional
P1: should have
repro: verified