cub issues

[FEA] Multi-buffer copy algorithm

12

I have `N` input buffers that I want to copy to `N` output buffers. I could sequentially call `cudaMemcpyAsync` `N` times, but in most cases it would be faster to...

jrhemstad

type: enhancement

P1: should have

helps: rapids

Build disabled CUB tests (CDP tests remain)

1

There are a number of tests that we currently aren't building because they are `ifdef`'d out -- @senior-zero discovered a few uncovered cases in `test_block_scan`, and @canonizer just reported another...

alliepiper

type: bug: functional

P0: must have

area: tests

Test with __CUDA_NO_HALF_CONVERSIONS__

While addressing the following [issue](https://github.com/NVIDIA/cub/pull/514) I realized that we haven't implemented tests with `__CUDA_NO_HALF_CONVERSIONS__`. The original motivation is described [here](https://github.com/NVIDIA/cub/issues/394).

gevtushenko

type: enhancement

P1: should have

area: tests

BlockRadixSort needs overloads that take the problem size and correctly sets the padding value for unused inputs

10

I am trying `cub::BlockRadixSort` with PyTorch, it is getting good performance, but I find it is hard to use: For example, if I want to sort 1023 elements, then I...

zasdfgbnm

type: enhancement

P2: nice to have

helps: pytorch

Launch failure potentially based on `cub::BlockReduce<double, TPB, cub::BlockReduceAlgorithm::BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY> BlockReduce`

3

Dear Maintainers, thank you for the awesome library, I really like it :) I have a strange launch failure when using `cub::BlockReduce BlockReduce` together with CUDA Dynamic Parallelism (CDP). When...

lkskstlr

info needed

triage

In-place execution of `DeviceSelect::Unique` algorithms

Per #503, it should be possible to introduce in-place versions of these algorithms.

alliepiper

type: enhancement

P3: backlog

Eliminate max_dim_x checks

Currently, scan is organized as follows: ```cpp int max_dim_x; if (CubDebug(error = cudaDeviceGetAttribute(&max_dim_x, cudaDevAttrMaxGridDimX, device_ordinal))) break; // Run grids in epochs (in case number of tiles exceeds max x-dimension int...

gevtushenko

type: enhancement

P2: nice to have

Add labels for CTest.

1

alliepiper

only: cmake

blocked

P2: nice to have

only: gpuci

Optimize cub::DeviceSegmented[Radix]Sort for small number of segments

4

Currently, `cub::DeviceSegmentedRadixSort` launches `num_segments` blocks and each block works on one segment. This approach does not have good performance when the number of segments is small: https://github.com/pytorch/pytorch/issues/63456. For small number...

zasdfgbnm

type: enhancement

area: performance

P3: backlog

Benchmarks in documentation are out-dated

- Current documentation includes benchmarks between out-dated versions of cub and thrust like v1.7.1 (DeviceReduce https://nvlabs.github.io/cub/structcub_1_1_device_reduce.html) - Current Thrust uses cub internally for some algorithms (e.g DeviceReduce) - It would...

fkallen

type: enhancement

P1: should have

area: performance

area: docs

cub
cub copied to clipboard

Metadata

[FEA] Multi-buffer copy algorithm

Build disabled CUB tests (CDP tests remain)

Test with __CUDA_NO_HALF_CONVERSIONS__

BlockRadixSort needs overloads that take the problem size and correctly sets the padding value for unused inputs

Launch failure potentially based on `cub::BlockReduce<double, TPB, cub::BlockReduceAlgorithm::BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY> BlockReduce`

In-place execution of `DeviceSelect::Unique` algorithms

Eliminate max_dim_x checks

Add labels for CTest.

Optimize cub::DeviceSegmented[Radix]Sort for small number of segments

Benchmarks in documentation are out-dated

← Metadata

Owner

Metadata

cub cub copied to clipboard

Metadata

← Metadata

Owner

Metadata

cub
cub copied to clipboard