SlabHash
SlabHash copied to clipboard
Race during insertion of key/values
I noticed that there were races in the code. For example, in line 45 of src/concurrent_map/warp:
uint32_t src_unit_data = (next == SlabHashT::A_INDEX_POINTER) ? *(getPointerFromBucket(src_bucket, laneId)) : *(getPointerFromSlab(next, laneId));
Here a weak load operation is used to read data from the bucket. Later on (line 65/line 83), an atomicCAS is performed to the same location.
Different threads may read and atomicCAS on the same location, creating a race.
@sashkiani @maawad
@AKKamath did you see other races of this type? Thank you for this report!
This seems repeated in many locations. For example src/concurrent_set/cset_warp_operations.cuh has a similar code snippet at lines 41 - 43.
Hi @AKKamath, although the bucket read operation doesn't perform any synchronization, the atomicCAS
operation (where we perform the write) will only succeed if the pointer points to an EMPTY_PAIR_64
(i.e., it will fail if we read an empty value, but before the atomicCAS
operation a different thread modified that memory location). If different threads try to perform a CAS operation, only one thread will succeed.
Could you explain what scenario would produce a race condition?
I'm also curious about how you found these multiple race conditions.
Since the read operation is "weak" it has no consistency guarantees. While I don't see any obvious race conditions, there has been prior work that shows data races in CUDA code breaks assumptions. In the same work, they change a weak store to an atomic operation as it was racing with an atomicCAS.
A simple fix would be changing the weak loads to a "strong" operation by typecasting them to volatile, or using an atomic operation instead, e.g. atomicAdd(getPointerFromBucket(src_bucket, laneId), 0)
Thanks for sharing that paper. I read it before. But as I mentioned, even though the read operation is weak, the operation that matters is the atomic CAS operation.
So by definition (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#conflict-and-data-races), there is a data race between the weak load and atomic. I guess it could be a benign data race in this case.
I continue to be interested in getting to the bottom of this. @maawad is there a data race / is it benign, or is this not a race?