cuCollections [FEA] Refactor of open address data structures

Is your feature request related to a problem? Please describe.

There is a significant amount of redundancy among the static_map/static_multimap/static_reduction_map classes. This is a large maintenance overhead and means optimizations made to one data structure do not translate to the others.

Furthermore, there are several configuration options we'd like to enable, like using AoS vs SOA, scalar vs. CG operations, etc.

I'd also like to enable adding a static_set and static_multiset classes that could share the same backend.

Describe the solution you'd like

List of things I'd like to address:

Use atomic_ref when possible (4B/8B key/value types) #183
Falls back on atomic when necessary (<4B, >8B key/value types)
- For <4B, we should probably just widen them ourselves and still use atomic_ref instead of atomic.
Eliminates redundancy among static_map/reduction_map/multimap
Enables AoS vs SoA layout (#103)
Enables statically sized device views
- We should use a pattern like std::span with std::dynamic_extent to support both dynamic and statically sized capacities.
Enables adding static_set and static_multiset
Supports the various insert schemes: packed/back2back/cas+dependent write
Switch between scalar/CG operations
Stream support everywhere (https://github.com/NVIDIA/cuCollections/issues/65)
Consistent use of bitwise_equal
Asynchronous size computation (#102, #237 )
rehashing (https://github.com/NVIDIA/cuCollections/issues/21)

My current thinking is to create an open_address_impl class that provides an abstraction for a logical array of "slots" and exposes operations on those slots. All the core logic and switching for things like AoS/SoA, atomic_ref/atomic can/should be implemented in this common impl class.

Oct 04 '21 15:10 jrhemstad

Something that came up with @PointKernel in the review of the static_multimap that we ultimately decided to wait on was enabling user configuration of using "scalar" vs "vector" loading. I think this should be exposed via the ProbingScheme type and adding an items_per_thread non-type template parameter that controls how many slots are loaded per-thread.

I also want to be able to disable using the CG algorithms all together via the ProbingScheme. You can current set CGSize == 1, but that will be inefficient as it still uses the CG code paths instead of the scalar code paths. We can either specialize for CGSize == 1 or add a special sentinel value for CGSize (like std::numerc_limits<size_t>::max()) that selects using non-CG code paths.

Nov 12 '21 16:11 jrhemstad

Another improvement we should try: using one CG to insert/retrieve multiple keys instead of a single key to amortize CG overhead.

Dec 06 '21 18:12 PointKernel

Status of the latest ideas we have discussed:

To make it convenient to construct a static_set or other data structures with a variety of configuration parameters, we want to use a variadic constructor pattern that is similar to this example: https://godbolt.org/z/T4oP1nsoW
- To make this more generic, instead of the get_or_default<T> function looking for a concrete type T, it could look for the first type that satisfies a given concept C, e.g., get_or_default<C>.
We want distinct "ProbingScheme" and "Storage" concepts
Storage is the thing that manages the slots, a probing scheme tells you which slots to search for a given key.
- Storage
  - A conceptual list of <key,payload> elements that need not necessarily be contiguous in memory. For example, vector< pair<key,value> > or vector<key> and vector<value> (i.e., AoS vs SoA) could be valid implementations of Storage.
  - Open questions
    - Own the mechanism for atomically updating a given slot
      - Versions of this that return a bool and/or iterator
    - Enable querying if concurrent insert/find operations are possible
    - It should provide some mechanism for loading an immutable "window" of slots. The intention is to provide a standard way of using vector load/store operations.
- ProbingScheme
  - Given a key k, a ProbingSequence provides a sequence of N potentially non-contiguous locations (or values) [i0, i1, i2, ... iN, EMPTY_SENTINEL] where if k exists it is present in [i0, iN]. Optionally, the sequence may be provided as a set of "windows" that partition the space into 1D "windows" of a fixed-size "W"

[ {i0,    i1,    ..., iW},
  {iW+1,  iW+2,  ..., i2W},
  {i2W+1, i2W+2, ..., i3W},
   ...
   {inW+1, inW+2, ..., i(n+1)W} // EMPTY_SENTINEL may appear anywhere in the  last window
]

Additional host interface functions:

is_present (rename contains)
is_absent
for_each
clear
reserve

Experimental work in defining these entities:

https://godbolt.org/z/9bosKoW4v (distinct "ProbingScheme" and "Storage" concepts)
https://godbolt.org/z/Maxe7j1o4 (probe iterator)
https://godbolt.org/z/557fjcvvE (equality wrapper)

Aug 05 '22 18:08 jrhemstad

Some open questions that still need answers:

Where is the CG size/window size defined?
- Is it specified by the Storage? ProbingScheme?
Different flavors of cooperative device-side overloads. E.g., contains:
1. N threads & 1 key: N threads in group call contains with the exact same k. All N threads cooperate to find the single k. The same result value is returned to all threads.
2. N threads & N keys: N threads in group call contains each with potentially distinct values for k (k0, k1, ..., kN). All N threads cooperate to find each ki. The returned result is potentially unique to each thread ti depending on the existence of ki.

Aug 08 '22 15:08 jrhemstad

Different flavors of cooperative device-side overloads. E.g., contains

We could use different signatures for

contains(CG g, key_type key, ..)
contains(CG g, KeyIter first, KeyIter last, ..)

The latter could implement a WCWS overload for when the key/value types can be shuffled around among group ranks. If the input range is wider than the group (notice this also implements the device bulk API idea we had), we would need to write the results to an output range instead of placing them as return values of the function.

Aug 09 '22 07:08 sleeepyjack

contains(CG g, KeyIter first, KeyIter last, ..)

I'm not sure this solves the problem. Would the semantics of this function be such that each thread in the group is providing a distinct [first,last) range? Or the same?

I think a contains function that takes an iterator range is orthogonal to the cooperative semantics of the function.

Aug 09 '22 22:08 jrhemstad

Would the semantics of this function be such that each thread in the group is providing a distinct [first,last) range? Or the same?

I don't see the benefit of the former variant, as you can simply call the function in a loop. I was referring to the latter variant: A coop group is assigned to a range of keys, implementing a "mini" parallel bulk version of the operation. If the data types allow for warp shuffles, we can make use of WCWS, i.e., coalesced load from the input range then iteratively broadcast each loaded datum to all ranks and subsequently call the cooperative insert/contains.

Aug 09 '22 22:08 sleeepyjack

I want APIs that don't require me to think about launching input_size * CG_size number of threads and wrangle the CG throughout the whole kernel.

I just want to launch input_size threads and then create an ad hoc CG to do a cooperative insert/find.

In other words, I want to keep the simple "one work item per thread" model of work assignment, but benefit from the cooperative execution by threads donating themselves to carrying out the work for another threads work item.

Aug 11 '22 22:08 jrhemstad

cuCollections cuCollections copied to clipboard

[FEA] Refactor of open address data structures

cuCollections
cuCollections copied to clipboard