cub
cub copied to clipboard
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
To give user some clue what's happening if the program gets compiled on a node with no GPU or if it gets compiled with different compute capability than the one...
Before porting to CUB, Thrust implementation of merge sort didn't use to have `*copy` version. When introducing `Copy` overload, I followed the CUB generic scheme of selecting output iterator value...
This PR addresses the following [issue](https://github.com/NVIDIA/cccl/issues/902) by replacing `__launch_bounds__` usages with `CUB_DETAIL_LAUNCH_BOUNDS`. `CUB_DETAIL_LAUNCH_BOUNDS` leads to `__launch_bounds__` usage only when RDC is **not** specified. Builds without RDC are not affected by...
Specifying `__launch_bounds__` in the presence of RDC has proven to be troublesome and unreliable. We have to abstract it out so that launch bounds are not specified when RDC is...
Currently, `BlockRadixRankMatchEarlyCounts` doesn't work in some specific cases `(1
Currently, we have a set of block radix rank facilities: - `BlockRadixRank` - `BlockRadixRankMatch` - `BlockRadixRankMatchEarlyCounts` There's also a `enum BlockScanAlgorithm` that describes the differences between these algorithms. Unlike the...
https://github.com/nvidia/cub/blob/main/cub/block/block_reduce.cuh#L135 the image is for block_scan
## Current Situation As discussed in https://github.com/NVIDIA/cub/issues/545, CUB needs to query the current device's compute capability in order to know which tuning policy to use for launching the kernel. Currently,...
I'd like to investigate implementing a reduction for associative, but non-commutative operations. Related to https://github.com/NVIDIA/thrust/issues/1434 The this kind of algorithm comes in handy when establishing the global context in parsing...