Georgii Evtushenko issues

Results 54 issues of


                                            Georgii Evtushenko

Device-scope segmented reduce temp storage allocation issue

Device-scope segmented reduce contains the following short-circuit: ```cpp if (num_segments

type: bug: functional

P1: should have

Allow RDC builds without CDP support

The following code leads to excessive memory footprint when compiled with `-rdc=true`: ```cpp #include #include template class cub::DispatchSegmentedSort; int main() { size_t free_byte{}, total_byte{}; double used{}; cudaMemGetInfo(&free_byte, &total_byte); used =...

type: enhancement

P1: should have

Restrict in-place execution

There's a blind spot in the Thrust/CUB in-place execution guarantees that I believe should be addressed. Thrust/CUB allow iterators to point to the same memory meanwhile there's no restriction on...

type: enhancement

P1: should have

area: docs

Guarantee stability in segmented radix sort

`cub::DeviceSegmentedRadixSort` guarantees are given in an indirect way, which confuses our users. Basically, we are saying that `cub::DeviceSegmentedRadixSort` shares its implementation with `cub::DeviceRadixSort` instead of providing a list of guarantees...

type: enhancement

P1: should have

area: docs

Test with __CUDA_NO_HALF_CONVERSIONS__

While addressing the following [issue](https://github.com/NVIDIA/cub/pull/514) I realized that we haven't implemented tests with `__CUDA_NO_HALF_CONVERSIONS__`. The original motivation is described [here](https://github.com/NVIDIA/cub/issues/394).

type: enhancement

P1: should have

area: tests

Eliminate max_dim_x checks

Currently, scan is organized as follows: ```cpp int max_dim_x; if (CubDebug(error = cudaDeviceGetAttribute(&max_dim_x, cudaDevAttrMaxGridDimX, device_ordinal))) break; // Run grids in epochs (in case number of tiles exceeds max x-dimension int...

type: enhancement

P2: nice to have

Fuse segments partitioning with sorting of small segments

I've found a corner case in the `cub::DeviceSegmentedSort` that can be optimized. If input data contains a lot of unit segments, the performance isn't optimal since we have to read...

type: enhancement

P2: nice to have

area: performance

Optimize DeviceSegmentedRadixSort

During the development of the [new segmented sort](https://github.com/NVIDIA/cub/pull/357), I extracted an `AgentSegmentedRadixSort` class. It's mostly based on the existing `DeviceSegmentedRadixSort` implementation. The only differences are: 1) `while (current_bit < end_bit)`...

type: enhancement

P2: nice to have

area: performance

Ignore last element in exclusive scan

The `AgentScan` structure doesn't distinguish between inclusive and exclusive variants at the load stage. Therefore, current version of the `cub::DeviceScan::ExclusiveScan` requires `num_items` items in the `input` data. According to the...

type: enhancement

P2: nice to have

helps: rapids

WARP_TIME_SLICING isn't supported in ScatterToStripedGuarded and ScatterToStripedFlagged

`BlockExchange` provides template parameter `WARP_TIME_SLICING`. It reduces the shared memory footprint. Most of the algorithms in the `BlockExchange` have specializations for different `WARP_TIME_SLICING` values. But it isn't the case for...

type: bug: functional

P2: nice to have