cccl
cccl copied to clipboard
Add support for large `num_items` to `DeviceScan::*ByKey`
Simply instantiating the scan_by_key
kernel template with a 64-bit offset type turns out that it incurs some noticeable performance downside of -~35% (or 1.5-fold slowdown) . Hence, we want to also consider alternative approaches like the streaming approach or the bit-packed tile state.
Overview of performance different for various offset types compared to the currently used i32
offset type.
Diff u32 vs i32 any num items | Diff u32 vs i32 2^28 num items | Diff i64 vs i32 any num items | Diff i64 vs i32 2^28 num items | Diff u64 vs i32 any num items | Diff u64 vs i32 2^28 num items | |
---|---|---|---|---|---|---|
min | 89.29% | 89.29% | 97.09% | 99.94% | 96.34% | 100.00% |
max | 106.35% | 102.38% | 135.51% | 131.50% | 134.84% | 131.56% |
avg | 100.06% | 99.79% | 107.28% | 108.38% | 107.31% | 108.42% |