Georgii Evtushenko

Results 54 issues of Georgii Evtushenko

Device-scope segmented reduce contains the following short-circuit: ```cpp if (num_segments

type: bug: functional
P1: should have

The following code leads to excessive memory footprint when compiled with `-rdc=true`: ```cpp #include #include template class cub::DispatchSegmentedSort; int main() { size_t free_byte{}, total_byte{}; double used{}; cudaMemGetInfo(&free_byte, &total_byte); used =...

type: enhancement
P1: should have

There's a blind spot in the Thrust/CUB in-place execution guarantees that I believe should be addressed. Thrust/CUB allow iterators to point to the same memory meanwhile there's no restriction on...

type: enhancement
P1: should have
area: docs

`cub::DeviceSegmentedRadixSort` guarantees are given in an indirect way, which confuses our users. Basically, we are saying that `cub::DeviceSegmentedRadixSort` shares its implementation with `cub::DeviceRadixSort` instead of providing a list of guarantees...

type: enhancement
P1: should have
area: docs

While addressing the following [issue](https://github.com/NVIDIA/cub/pull/514) I realized that we haven't implemented tests with `__CUDA_NO_HALF_CONVERSIONS__`. The original motivation is described [here](https://github.com/NVIDIA/cub/issues/394).

type: enhancement
P1: should have
area: tests

Currently, scan is organized as follows: ```cpp int max_dim_x; if (CubDebug(error = cudaDeviceGetAttribute(&max_dim_x, cudaDevAttrMaxGridDimX, device_ordinal))) break; // Run grids in epochs (in case number of tiles exceeds max x-dimension int...

type: enhancement
P2: nice to have

I've found a corner case in the `cub::DeviceSegmentedSort` that can be optimized. If input data contains a lot of unit segments, the performance isn't optimal since we have to read...

type: enhancement
P2: nice to have
area: performance

During the development of the [new segmented sort](https://github.com/NVIDIA/cub/pull/357), I extracted an `AgentSegmentedRadixSort` class. It's mostly based on the existing `DeviceSegmentedRadixSort` implementation. The only differences are: 1) `while (current_bit < end_bit)`...

type: enhancement
P2: nice to have
area: performance

The `AgentScan` structure doesn't distinguish between inclusive and exclusive variants at the load stage. Therefore, current version of the `cub::DeviceScan::ExclusiveScan` requires `num_items` items in the `input` data. According to the...

type: enhancement
P2: nice to have
helps: rapids

`BlockExchange` provides template parameter `WARP_TIME_SLICING`. It reduces the shared memory footprint. Most of the algorithms in the `BlockExchange` have specializations for different `WARP_TIME_SLICING` values. But it isn't the case for...

type: bug: functional
P2: nice to have