cub icon indicating copy to clipboard operation
cub copied to clipboard

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

Results 91 cub issues
Sort by recently updated
recently updated
newest added

Hi, I spent quite some time chasing down a bug which ended up being a misunderstanding of the behavior of block scans when an initial val param is given. It...

type: enhancement
P1: should have
area: docs

Not sure if this is a bug or a feature, but it surely is not the behavior the docs suggest. I tried this with cub 1.8.0 and CUDA 10.1 and...

type: enhancement
P1: should have
area: docs

Hi, I am observing a problem with the cub library for sum reduction with standard datatypes. ## Summary A sum reduction using `cub::DeviceReduce::Sum` for integers causes pytorch code to crash...

type: bug: functional
P1: should have
triage

Using cub's block reductions in kernels compiled using nvrtc (using Jitify), fail to compile for specific block sizes. See the error produced below, where template type deduction seems to be...

type: bug: functional
P1: should have
triage
helps: quda

Currently, cub includes a number of system headers causing errors such as > C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include\vadefs.h(143): error: A function without execution space annotations (__host__/__device__/__global__) is considered a...

type: enhancement
P2: nice to have
area: ci

When I used an iterator as an input for device-reduce reducing kernel was limited by amount of registers. The iterator does a few math operation on data in global memory...

type: enhancement
P2: nice to have

When using `WarpScanShfl` from `warp_scan_shfl.cuh` inside a `while()` loop and in conjunction with a sub-warp `LOGICAL_WARP_THREADS` argument, i.e. `LOGICAL_WARP_THREADS=2^n` with `n

type: bug: functional
info needed
P3: backlog
repro: missing

For all warp-based cub api, say warpscan, the example given by the document do not use __syncwarp to sync threads within a warp. However, it seems that in volta, threads...

type: enhancement
P1: should have
area: docs

FYI, HistogramRange is about half the performance for 8 bit data in 1.8.0 as was 1.3.2B, but everything else is about twice as fast. V100 on Cuda 9.1 with an...

type: enhancement
area: performance
unverified
P3: backlog

```cpp ... int input = 1; int init = 10; int inclusive_output, exclusive_output; using WarpScan = cub::WarpScan; __shared__ typename WarpScan::TempStorage storage[warps_no]; WarpScan(storage[warp_id]) .Scan(input, inclusive_output, exclusive_output, init, cub::Sum()); ... ``` Should...

type: enhancement
P1: should have
area: docs
area: tests