cub icon indicating copy to clipboard operation
cub copied to clipboard

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

Results 91 cub issues
Sort by recently updated
recently updated
newest added

Fixes issue NVIDIA/cccl#854

type: bug: functional
info needed
P1: should have
repro: missing

n.b. this change to test code introduces a lot of failures in the CUB unit tests. I'm not submitting those fixes -- I've done a few but not all of...

P2: nice to have
triage

1. There is no difference in performance and compilation time for the reduce with simple operators. On complex operators (256 sqrt calls), the compilation time is up to 2.4 times...

The cub::BFE and cub::BFI wrappers are useful, because nvcc likes to place bitfield structs in local memory rather than registers, making it far more efficient to work with 64-bit integers...

type: enhancement
P1: should have

This is an exposure for Ampere's asynchronous copy mechanism, based on [CUTLASS's implementation](https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/arch/memory_sm80.h). These primitives are useful for people writing their own kernels, BUT we can also potentially use them...

This failure doesn't reproduce with GCC. Disabling for now. ``` [17:53:21]:wash@voyager:/home/wash/development/nvidia/cuda_linux_p4/sw/gpgpu/thrust:0:$ ci/local/build.bash -i gpuci/cccl:cuda11.3.1-devel-ubuntu20.04-icclatest cub.cpp17.test.device_histogram cuda11.3.1-devel-ubuntu20.04-icclatest: Pulling from gpuci/cccl Digest: sha256:e20e996de6f79a75754789746ad0e3535ddc82b20706fde67db489f56ca5cefc Status: Image is up to date for gpuci/cccl:cuda11.3.1-devel-ubuntu20.04-icclatest docker.io/gpuci/cccl:cuda11.3.1-devel-ubuntu20.04-icclatest...

P1: should have
type: bug: compiler

There's probably some odd floating point nonsense happening here. Doesn't reproduce with GCC. Disabling for now. ``` [19:55:32]:wash@voyager:/home/wash/development/nvidia/cuda_linux_p4/sw/gpgpu/thrust:0:$ ci/local/build.bash -i gpuci/cccl:cuda11.3.1-devel-ubuntu20.04-icclatest cub.cpp17.test.device_radix_sort.minimal cuda11.3.1-devel-ubuntu20.04-icclatest: Pulling from gpuci/cccl Digest: sha256:e20e996de6f79a75754789746ad0e3535ddc82b20706fde67db489f56ca5cefc Status: Image...

P1: should have
type: bug: compiler

Sorry if I'm missing the obvious point, but: - The [documentation for BlockRadixSort](https://nvlabs.github.io/cub/classcub_1_1_block_radix_sort.html#pub-methods) divides the sorting methods in purely "blocked arrangements" and "blocked arrangement -> striped arrangement". - Another quote:...

question

There're actually two questions, but I believe they are related. * How to block-sort only first N elements? This problem usually occur when processing last block. Currently I'm using dummy...

question