cub issues

[Documentation] Behavior of block scans with initial value param

Hi, I spent quite some time chasing down a bug which ended up being a misunderstanding of the behavior of block scans when an initial val param is given. It...

pgera

type: enhancement

P1: should have

area: docs

[Bug?] WarpReduce: Unexpected results with logical warp size < 32

2

Not sure if this is a bug or a feature, but it surely is not the behavior the docs suggest. I tried this with cub 1.8.0 and CUDA 10.1 and...

RaulPPelaez

type: enhancement

P1: should have

area: docs

Problem with cub::DeviceReduce::Sum and integer addition

5

Hi, I am observing a problem with the cub library for sum reduction with standard datatypes. ## Summary A sum reduction using `cub::DeviceReduce::Sum` for integers causes pytorch code to crash...

classner

type: bug: functional

P1: should have

triage

cub block reductions fail to compile correctly with nvrtc for certain block sizes

5

Using cub's block reductions in kernels compiled using nvrtc (using Jitify), fail to compile for specific block sizes. See the error produced below, where template type deduction seems to be...

maddyscientist

type: bug: functional

P1: should have

triage

helps: quda

Make cub NVRTC-compatible

2

Currently, cub includes a number of system headers causing errors such as > C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include\vadefs.h(143): error: A function without execution space annotations (__host__/__device__/__global__) is considered a...

l0calh05t

type: enhancement

P2: nice to have

area: ci

Allow custom tuning policies to be passed into device algorithms.

1

When I used an iterator as an input for device-reduce reducing kernel was limited by amount of registers. The iterator does a few math operation on data in global memory...

sh1ng

type: enhancement

P2: nice to have

__shfl_sync instructions may have wrong member mask

2

When using `WarpScanShfl` from `warp_scan_shfl.cuh` inside a `while()` loop and in conjunction with a sub-warp `LOGICAL_WARP_THREADS` argument, i.e. `LOGICAL_WARP_THREADS=2^n` with `n

jglaser

type: bug: functional

info needed

P3: backlog

repro: missing

Explicitly document synchronization requirements in Warp-level APIs

1

For all warp-based cub api, say warpscan, the example given by the document do not use __syncwarp to sync threads within a warp. However, it seems that in volta, threads...

desert0616

type: enhancement

P1: should have

area: docs

Performance regression in 8-bit HistogramRange between 1.3.2B and 1.8.0

1

FYI, HistogramRange is about half the performance for 8 bit data in 1.8.0 as was 1.3.2B, but everything else is about twice as fast. V100 on Cuda 9.1 with an...

dumerrill

type: enhancement

area: performance

unverified

P3: backlog

Document and test WarpScan::Scan's treatment of `init` argument

1

```cpp ... int input = 1; int init = 10; int inclusive_output, exclusive_output; using WarpScan = cub::WarpScan; __shared__ typename WarpScan::TempStorage storage[warps_no]; WarpScan(storage[warp_id]) .Scan(input, inclusive_output, exclusive_output, init, cub::Sum()); ... ``` Should...

jszuppe

type: enhancement

P1: should have

area: docs

area: tests

cub
cub copied to clipboard

Metadata

[Documentation] Behavior of block scans with initial value param

[Bug?] WarpReduce: Unexpected results with logical warp size < 32

Problem with cub::DeviceReduce::Sum and integer addition

cub block reductions fail to compile correctly with nvrtc for certain block sizes

Make cub NVRTC-compatible

Allow custom tuning policies to be passed into device algorithms.

__shfl_sync instructions may have wrong member mask

Explicitly document synchronization requirements in Warp-level APIs

Performance regression in 8-bit HistogramRange between 1.3.2B and 1.8.0

Document and test WarpScan::Scan's treatment of `init` argument

← Metadata

Owner

Metadata

cub cub copied to clipboard

Metadata

← Metadata

Owner

Metadata

cub
cub copied to clipboard