cccl
cccl copied to clipboard
[FEA]: Improve and cleanup `ThreadLoad`
Is this a duplicate?
- [x] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct
Area
CUB
Is your feature request related to a problem? Please describe.
ThreadLoad
is a fundamental utility in CUB. On the other hand, it misses some key features, especially on recent GPU architectures.
Describe the solution you'd like
Feature list:
- Cache eviction policies on recent GPU architectures (SM70+) are missing
- Cache prefetch, e.g.
.L2::256B
, is not handled - Non-trivially copyable data types are handled incorrectly
- We don't support 32-bit platforms anymore. The following macros are not more needed
_CUB_ASM_PTR_
,__CUB_LP64__
,_CUB_ASM_PTR_SIZE_
- Some template specializations of
ThreadLoad
may not be used, e.g.ushort4
, adding overhead to the compilation - Add assertions for
nullptr
accesses and validate the correct memory space, i.e.global
- Expose in the
cub::
namespace - Add them to public documentation
Describe alternatives you've considered
No response
Additional context
No response