cccl icon indicating copy to clipboard operation
cccl copied to clipboard

[FEA]: Improve and cleanup `ThreadLoad`

Open fbusato opened this issue 4 months ago • 0 comments

Is this a duplicate?

  • [x] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

CUB

Is your feature request related to a problem? Please describe.

ThreadLoad is a fundamental utility in CUB. On the other hand, it misses some key features, especially on recent GPU architectures.

Describe the solution you'd like

Feature list:

  • Cache eviction policies on recent GPU architectures (SM70+) are missing
  • Cache prefetch, e.g. .L2::256B, is not handled
  • Non-trivially copyable data types are handled incorrectly
  • We don't support 32-bit platforms anymore. The following macros are not more needed _CUB_ASM_PTR_, __CUB_LP64__, _CUB_ASM_PTR_SIZE_
  • Some template specializations of ThreadLoad may not be used, e.g. ushort4, adding overhead to the compilation
  • Add assertions for nullptr accesses and validate the correct memory space, i.e. global
  • Expose in the cub:: namespace
  • Add them to public documentation

Describe alternatives you've considered

No response

Additional context

No response

fbusato avatar Sep 30 '24 19:09 fbusato