Elias Stehle

Results 23 issues of Elias Stehle

Simply instantiating the kernel template with different offset types unfortunately does degrade performance by as much as 1.66-fold. This needs further investigation on alternative approaches, like streaming, bit-packed tile state,...

## Description Closes https://github.com/NVIDIA/cccl/issues/2458 `ScanByKey` used to have the tile state comprising (1) the accumulated value and (2) the `OffsetT`. The `OffsetT` part is used by `ReduceByKey` to figure the...

Simply instantiating the `scan_by_key` kernel template with a 64-bit offset type turns out that it incurs some noticeable performance downside of -~35% (or 1.5-fold slowdown) . Hence, we want to...