cub
cub copied to clipboard
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
Hi, I spent quite some time chasing down a bug which ended up being a misunderstanding of the behavior of block scans when an initial val param is given. It...
Not sure if this is a bug or a feature, but it surely is not the behavior the docs suggest. I tried this with cub 1.8.0 and CUDA 10.1 and...
Hi, I am observing a problem with the cub library for sum reduction with standard datatypes. ## Summary A sum reduction using `cub::DeviceReduce::Sum` for integers causes pytorch code to crash...
Using cub's block reductions in kernels compiled using nvrtc (using Jitify), fail to compile for specific block sizes. See the error produced below, where template type deduction seems to be...
Currently, cub includes a number of system headers causing errors such as > C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include\vadefs.h(143): error: A function without execution space annotations (__host__/__device__/__global__) is considered a...
When I used an iterator as an input for device-reduce reducing kernel was limited by amount of registers. The iterator does a few math operation on data in global memory...
When using `WarpScanShfl` from `warp_scan_shfl.cuh` inside a `while()` loop and in conjunction with a sub-warp `LOGICAL_WARP_THREADS` argument, i.e. `LOGICAL_WARP_THREADS=2^n` with `n
For all warp-based cub api, say warpscan, the example given by the document do not use __syncwarp to sync threads within a warp. However, it seems that in volta, threads...
FYI, HistogramRange is about half the performance for 8 bit data in 1.8.0 as was 1.3.2B, but everything else is about twice as fast. V100 on Cuda 9.1 with an...
```cpp ... int input = 1; int init = 10; int inclusive_output, exclusive_output; using WarpScan = cub::WarpScan; __shared__ typename WarpScan::TempStorage storage[warps_no]; WarpScan(storage[warp_id]) .Scan(input, inclusive_output, exclusive_output, init, cub::Sum()); ... ``` Should...