cuCollections
cuCollections copied to clipboard
[FEA] Hash map design that supports reduce by key
Is your feature request related to a problem? Please describe.
I would like to be able to implement a hash-based reduce-by-key operation. A thrust::reduce_by_key operation requires first doing a sort_by_key. This means 2 passes through the input data and 2n memory overhead. A hash based implementation requires only a single pass and n/0.7 memory overhead (assuming a 70% load factor).
Trying to implement a reduce-by-key with today's static_map implementation is effectively impossible.
There's two ways you could implement a hash-based rbk (assume the reduction is sum).
The first assumes that insert returns an iterator to the slot where the inserted occurred, or the slot that contains an equivalent key:
template <typename Map, typename Key, typename Value>
hash_rbk(Map m, Key key_begin, Key key_end, Value value_begin){
auto found = map.insert(key_begin + tid, value_begin + tid); // assume `insert` returns an iterator to the slot where the insert occurred
// key already exists, accumulate with the existing value
if(found.second){
auto& found_pair = (found.first)
found_pair.second.fetch_add(value_begin + tid);
}
}
The second assumes you can do concurrent insert and find:
template <typename Map, typename Key, typename Value>
hash_rbk(Map m, Key key_begin, Key key_end, Value value_begin){
auto found = map.find(key_begin + tid); // look if the key exists
// key already exists, accumulate with the existing value
if(found != map.end()){
found.second.fetch_add(value_begin + tid);
} else { // key doesn't exist
auto found = map.insert(key_begin + tid, value_begin + tid);
if(not found.second){ // someone else already did the insert, accumulate with their value
found.first.second.fetch_add(value_begin+tid);
}
}
}
However, both of these are impossible because:
- The
insertfunction cannot return an iterator (due to the potential for a race condition caused by the back-to-back CAS algorithm), and instead returns abool.- This results from the fact that we use
pair<atomic<K>, atomic<V>>as the slot storage type and use back to back CAS. If we instead usedatomic<pair<K,V>>we could return an iterator
- This results from the fact that we use
- You cannot do concurrent insert/find operations.
- This results from the fact that we use
pair<atomic<K>, atomic<V>>as the slot storage type and use back to back CAS. If we instead usedatomic<pair<K,V>>we could support concurrent insert and find.
- This results from the fact that we use
Describe the solution you'd like
Explore a hash map implementation that allows implementing a reduce by key.
One option is to use a atomic<pair<K,V>> implementation, but that has it's own drawbacks. See https://docs.google.com/presentation/d/1_UJlQoDc985sj03grMroB2zcCbiiC7-ua_1Eqm4qrRU/edit?usp=sharing
I don't quite understand. At the end of the insert method, there is a guarantee that a slot will contain the key and a matching value (the value could be from one of the many threads writing to the slot). The race condition is only within the insert method, between the key cas and value cas. Therefore, it should be possible to return the iterator wrapped in an optional, with the optional containing the value only if the current thread's value was written in the slot. Or simply be modeled after cudf's hash map which returns a bool along with the iterator.
Or is the limitation from a design standpoint? That cuCo's maps should only return an iterator and not a pair<bool, iterator>.
The problem with returning an iterator in insert with the back-to-back CAS algorithm is best understood through example.
Imagine two threads doing a concurrent insert for the same key. Thread 0 inserts <k, v0>, and Thread 1 inserts <k, v1>. Furthermore, let's say k maps to slot s.
Let's say T0 wins the key CAS, but the value CAS has not completed yet. So s's state is <k, EMPTY>.
At the same time, T1 performs the insert <k, v1>. It sees that s.first == k and therefore the slot is not empty and the key is equal to the same k that is being inserted. Therefore, T1 get's an iterator that points at s.
T1 now holds a pointer to the slot in the state <k, EMPTY>. Let's assume the value CAS from T0 has still not completed. If T1 attempted to do anything with the value in s, like add v1 to the existing value, then it would be operating on an invalid value EMPTY.
The jist of it is, returning an iterator from insert with the back-to-back CAS algorithm allows for other threads to observe slots in a partially written state. This is bad.
You might ask, "Why not have T1's insert poll on s.second and not return until the value is different from EMPTY?".
We tried that and there was a fairly large performance hit. @Nicolas-Iskos may still have the performance graph that shows the impact.
Closed by #515