cccl
cccl copied to clipboard
Try to mitigate performance degradation when moving from 32- to 64-bit offset types when using bit-packed tile states in decoupled look-back
In https://github.com/NVIDIA/cccl/issues/2055, we experimented with using bit-packed tile states in the decoupled look-back of algorithms that need to carry the offset type in the decoupled look-back.
While the overall the performance for 64-bit offset types improved when using bit-packed tile states compared to using regular tile states, performance of 64-bit offset types still lags a good bit behind 32-bit offset types.
We want to investigate where the remaining performance degradation comes from. One possibility to mitigate that performance degradation is to use two different offset types within the relevant algorithms: (1) one that is used for indexing items within a tile and (2) one that is used for indexing within global memory.