cccl icon indicating copy to clipboard operation
cccl copied to clipboard

[FEA]: Migrate cuCollections data structures to CCCL

Open PointKernel opened this issue 8 months ago • 6 comments

Is this a duplicate?

  • [x] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

CUDA Experimental (cudax)

Is your feature request related to a problem? Please describe.

https://github.com/NVIDIA/cuCollections provides concurrent data structures for GPUs, including hash tables, Bloom filters, and HyperLogLog, etc. These components would be a better fit within the CCCL library.

Describe the solution you'd like

Migrating the entire cuCollections repository in one step would be too bulky. Instead, we could split the migration into independent parts:

  • Utilities like hash functions and fast_int
  • Bloom filter
  • HyperLogLog

For hash tables, cuCollections primarily supports four fixed-size variants that aim to be as standard-conforming as possible:

  • static_set
  • static_map
  • static_multiset
  • static_multimap

All these static hash tables share a common open addressing base class. Additionally, cuCollections includes a dynamic_map, which extends static_map by allowing resizing when a certain occupancy threshold is reached. Since this data structure has only one known user, we can defer discussions on its migration until later.

Proposed Action Items

  • [ ] Migrate hash functions to CCCL: murmurhash, xxhash, (identity hash ?)
  • [ ] Migrate other utilities like fast_int, pair, extent, storage to CCCL
  • [ ] Migrate Bloom filter to CCCL
  • [ ] Migrate HyperLogLog to CCCL
  • [ ] Migrate hash tables
    • [ ] Open addressing base class & static_set
    • [ ] static_map
    • [ ] static_multiset
    • [ ] static_multimap
    • [ ] Evaluate the need to migrate dynamic_map

Additional context

  • Namespace: For now, all cuCollections components should be placed under the experimental cudax namespace.
  • C++ Standard Compatibility: cuCollections is currently a C++17 library. If compatibility with C++14 or newer standards (like C++20) is desired, additional work will be required.
  • Clang Compatibility: It is unclear whether the current cuCollections implementation is fully compatible with Clang.

PointKernel avatar Apr 01 '25 21:04 PointKernel

From my point of view we should first consider where cuCollections should live.

Given the general plan to unify CCCL more I would suggest into moving it libcu++. That is especially true for more or less foundational features like hashes. I would need to look into the actual code to see whether we can just drop pair and use cuda::std::pair.

Regarding the containers, the question is whether we want to reuse memory resources from cudax or just drop them in as is. If we just want to use them unmodified I would also say <cuda/> would be the right place

miscco avatar Apr 02 '25 06:04 miscco

@miscco Thanks for the comments! I'm open to whatever the CCCL team thinks is best. I'm just providing additional context to help facilitate the decision-making process.

For reference, cuco data structures aim to be as standard-conforming as possible while also offering better flexibility for performance tuning and GPU optimization. As a result, they are not direct drop-in replacements for their std:: equivalents.

Regarding some of the points you raised:

  • Hash Functions: Hashing is a fundamental feature, and cuco provides 32-bit, 64-bit, and 128-bit hashers along with their variants, such as xxhash_32 and murmurhash3_x86_32. In contrast, STL hash functions are limited to 64-bit.
  • Pair Structure: I'm not sure about the current state of cuda::std::pair, but historically, the key difference between cuco::pair and thrust::pair was that cuco::pair is always aligned. This alignment is necessary for atomic operations during hash table insertion and queries.
  • Allocator: The current cuco allocator is a simple wrapper around cudaMalloc and cudaFree. Using cudax memory resource would likely be a more suitable approach.

PointKernel avatar Apr 02 '25 18:04 PointKernel

A few comments:

  • please note that std::pair/cuda::std::pair is not even trivially copyable. @miscco maybe we can think about a cuda::pair version focused on performance.
  • "fast_int" is also tracked here https://github.com/NVIDIA/cccl/issues/3442
  • Is extent equivalent to std::extent?
  • I agree with Michael about hash functions. I would expect simple APIs that can be broadly accepted and used across the ecosystem.

fbusato avatar Apr 02 '25 18:04 fbusato

Is extent equivalent to std::extent?

No, cuco::extent is a 1D size wrapper for both dynamic and statically sized capacities. e.g. https://godbolt.org/z/v34nsbW4c

PointKernel avatar Apr 02 '25 19:04 PointKernel

  • please note that std::pair/cuda::std::pair is not even trivially copyable. @miscco maybe we can think about a cuda::pair version focused on performance.

cuda::std::pair is trivially copyable alignment is an issue

miscco avatar Apr 03 '25 05:04 miscco

I would suggest we start with our cuco hashers and find a place that feels right for them in CCCL.

We have a whole zoo of hashers here. If we want to put them under the cuda:: namespace, I suggest to do the following to not flood the namespace with a bunch of names:

namespace cuda {
template <class T, hash_strategy Strategy = hash_strategy::WHATEVER_STL_IS_USING>
hash {…};
}

and then let hash_strategy be an enum class type that enumerates the hashers we have in our zoo. This would also make the interface quite close to the std::hash counterpart.

This would be the easy route. However, if we want to make the strategy to be customizable by the user (i.e. passing their own implementation), then we probably need some additional concept magic.

What do you think?

sleeepyjack avatar Apr 03 '25 23:04 sleeepyjack