cuCollections
cuCollections copied to clipboard
[FEA] Refactor of open address data structures
Is your feature request related to a problem? Please describe.
There is a significant amount of redundancy among the static_map/static_multimap/static_reduction_map classes. This is a large maintenance overhead and means optimizations made to one data structure do not translate to the others.
Furthermore, there are several configuration options we'd like to enable, like using AoS vs SOA, scalar vs. CG operations, etc.
I'd also like to enable adding a static_set and static_multiset classes that could share the same backend.
Describe the solution you'd like
List of things I'd like to address:
- Use atomic_ref when possible (4B/8B key/value types) #183
- Falls back on atomic when necessary (<4B, >8B key/value types)
- For <4B, we should probably just widen them ourselves and still use
atomic_refinstead ofatomic.
- For <4B, we should probably just widen them ourselves and still use
- Eliminates redundancy among static_map/reduction_map/multimap
- Enables AoS vs SoA layout (#103)
- Enables statically sized device views
- We should use a pattern like
std::spanwithstd::dynamic_extentto support both dynamic and statically sized capacities.
- We should use a pattern like
- Enables adding
static_setandstatic_multiset - Supports the various insert schemes: packed/back2back/cas+dependent write
- Switch between scalar/CG operations
- Stream support everywhere (https://github.com/NVIDIA/cuCollections/issues/65)
- Consistent use of
bitwise_equal - Asynchronous size computation (#102, #237 )
- rehashing (https://github.com/NVIDIA/cuCollections/issues/21)
My current thinking is to create an open_address_impl class that provides an abstraction for a logical array of "slots" and exposes operations on those slots. All the core logic and switching for things like AoS/SoA, atomic_ref/atomic can/should be implemented in this common impl class.
Something that came up with @PointKernel in the review of the static_multimap that we ultimately decided to wait on was enabling user configuration of using "scalar" vs "vector" loading. I think this should be exposed via the ProbingScheme type and adding an items_per_thread non-type template parameter that controls how many slots are loaded per-thread.
I also want to be able to disable using the CG algorithms all together via the ProbingScheme. You can current set CGSize == 1, but that will be inefficient as it still uses the CG code paths instead of the scalar code paths. We can either specialize for CGSize == 1 or add a special sentinel value for CGSize (like std::numerc_limits<size_t>::max()) that selects using non-CG code paths.
Another improvement we should try: using one CG to insert/retrieve multiple keys instead of a single key to amortize CG overhead.
Status of the latest ideas we have discussed:
- To make it convenient to construct a
static_setor other data structures with a variety of configuration parameters, we want to use a variadic constructor pattern that is similar to this example: https://godbolt.org/z/T4oP1nsoW- To make this more generic, instead of the
get_or_default<T>function looking for a concrete typeT, it could look for the first type that satisfies a given conceptC, e.g.,get_or_default<C>.
- To make this more generic, instead of the
- We want distinct "ProbingScheme" and "Storage" concepts
- Storage is the thing that manages the slots, a probing scheme tells you which slots to search for a given key.
- Storage
- A conceptual list of <key,payload> elements that need not necessarily be contiguous in memory. For example,
vector< pair<key,value> >orvector<key>andvector<value>(i.e., AoS vs SoA) could be valid implementations ofStorage. - Open questions
- Own the mechanism for atomically updating a given slot
- Versions of this that return a bool and/or iterator
- Enable querying if concurrent insert/find operations are possible
- It should provide some mechanism for loading an immutable "window" of slots. The intention is to provide a standard way of using vector load/store operations.
- Own the mechanism for atomically updating a given slot
- A conceptual list of <key,payload> elements that need not necessarily be contiguous in memory. For example,
- ProbingScheme
- Given a key
k, aProbingSequenceprovides a sequence ofNpotentially non-contiguous locations (or values)[i0, i1, i2, ... iN, EMPTY_SENTINEL]where ifkexists it is present in[i0, iN]. Optionally, the sequence may be provided as a set of "windows" that partition the space into 1D "windows" of a fixed-size "W"
- Given a key
- Storage
[ {i0, i1, ..., iW},
{iW+1, iW+2, ..., i2W},
{i2W+1, i2W+2, ..., i3W},
...
{inW+1, inW+2, ..., i(n+1)W} // EMPTY_SENTINEL may appear anywhere in the last window
]
Additional host interface functions:
- is_present (rename contains)
- is_absent
- for_each
- clear
- reserve
Experimental work in defining these entities:
- https://godbolt.org/z/9bosKoW4v (distinct "ProbingScheme" and "Storage" concepts)
- https://godbolt.org/z/Maxe7j1o4 (probe iterator)
- https://godbolt.org/z/557fjcvvE (equality wrapper)
Some open questions that still need answers:
-
Where is the CG size/window size defined?
- Is it specified by the
Storage?ProbingScheme?
- Is it specified by the
-
Different flavors of cooperative device-side overloads. E.g.,
contains:- N threads & 1 key:
Nthreads ingroupcallcontainswith the exact samek. AllNthreads cooperate to find the singlek. The same result value is returned to all threads. - N threads & N keys:
Nthreads ingroupcallcontainseach with potentially distinct values fork(k0, k1, ..., kN). AllNthreads cooperate to find eachki. The returned result is potentially unique to each threadtidepending on the existence ofki.
- N threads & 1 key:
Different flavors of cooperative device-side overloads. E.g., contains
We could use different signatures for
contains(CG g, key_type key, ..)contains(CG g, KeyIter first, KeyIter last, ..)
The latter could implement a WCWS overload for when the key/value types can be shuffled around among group ranks. If the input range is wider than the group (notice this also implements the device bulk API idea we had), we would need to write the results to an output range instead of placing them as return values of the function.
contains(CG g, KeyIter first, KeyIter last, ..)
I'm not sure this solves the problem. Would the semantics of this function be such that each thread in the group is providing a distinct [first,last) range? Or the same?
I think a contains function that takes an iterator range is orthogonal to the cooperative semantics of the function.
Would the semantics of this function be such that each thread in the group is providing a distinct [first,last) range? Or the same?
I don't see the benefit of the former variant, as you can simply call the function in a loop. I was referring to the latter variant: A coop group is assigned to a range of keys, implementing a "mini" parallel bulk version of the operation. If the data types allow for warp shuffles, we can make use of WCWS, i.e., coalesced load from the input range then iteratively broadcast each loaded datum to all ranks and subsequently call the cooperative insert/contains.
I want APIs that don't require me to think about launching input_size * CG_size number of threads and wrangle the CG throughout the whole kernel.
I just want to launch input_size threads and then create an ad hoc CG to do a cooperative insert/find.
In other words, I want to keep the simple "one work item per thread" model of work assignment, but benefit from the cooperative execution by threads donating themselves to carrying out the work for another threads work item.