cccl icon indicating copy to clipboard operation
cccl copied to clipboard

Add support for large `num_items` to `DeviceScan::*ByKey`

Open elstehle opened this issue 5 months ago • 0 comments

Simply instantiating the scan_by_key kernel template with a 64-bit offset type turns out that it incurs some noticeable performance downside of -~35% (or 1.5-fold slowdown) . Hence, we want to also consider alternative approaches like the streaming approach or the bit-packed tile state.

Overview of performance different for various offset types compared to the currently used i32 offset type.

Diff u32 vs i32 any num items Diff u32 vs i32 2^28 num items Diff i64 vs i32 any num items Diff i64 vs i32 2^28 num items Diff u64 vs i32 any num items Diff u64 vs i32 2^28 num items
min 89.29% 89.29% 97.09% 99.94% 96.34% 100.00%
max 106.35% 102.38% 135.51% 131.50% 134.84% 131.56%
avg 100.06% 99.79% 107.28% 108.38% 107.31% 108.42%

elstehle avatar Sep 25 '24 16:09 elstehle