dolma icon indicating copy to clipboard operation
dolma copied to clipboard

Potential race condition

Open chris-ha458 opened this issue 10 months ago • 5 comments

I have been looking into https://github.com/allenai/dolma/blob/main/src/bloom_filter.rs Specifically how it was thread-safe

pub fn contains_hashes(&self, hashes: &Vec<u64>) -> bool {
    for hash in hashes {
        let hash = *hash as usize;
        let index = hash / 32 % self.bits.len();
        let bit = hash % 32;
        if self.bits[index].load(Ordering::Relaxed) & (1 << bit) == 0 {
            return false;
        }
    }
    return true;
}

Here, while the access to individual AtomicU32 values within the array is atomic, the method is checking a condition over multiple AtomicU32 values without a lock on the Vec. As such multiple threads could be accessing individual AtomicU32 values independently of each other in both read or write. If one or more of the values change due to another thread's write operation during the execution of contains_hashes, it might lead to an incorrect result.

Consider the scenario where thread A is executing contains_hashes and finds that all the required bits are set. Before thread A completes execution, thread B updates one of the bits that thread A already checked. Since the check is not atomic across the entire vector, thread A's result becomes stale, and the method might return false when it should return true. This problem is compounded if there are more threads exist changing other items that have already been checked by other threads. If there are a higher degree of true positives(duplicates) the issue is also exacerbated. This is particularly concerning since Bloom filters are not supposed to produce false negatives, the problem becomes worse as we try to increase threads or the underlying data has a lot of duplicates.

In summary, The code exhibits a race condition where the result of the contains_hashes method can be affected by concurrent updates to the bits vector. The individual accesses to AtomicU32 are atomic, but the method's logic requires a consistent view of multiple AtomicU32 values, and this consistency is not guaranteed.

To correct this, a higher-level synchronization mechanism (e.g., a read-write lock on the whole of the Vec) might be required to ensure that the entire check operation in contains_hashes is atomic with respect to updates to the bits vector.

This is the approach taken by another rust crate ofilter. Relevant source

pub struct SyncBloom<T> {
    inner: Arc<RwLock<Bloom<T>>>,
}

This is an approach taken by google's Guava library for Go Their solution is not complete nor comprehensive, with a lot of tradeoffs in the datastructures, types and methods operating upon them and tests to validate that their tradeoffs do not cause too much error. (I do not like this approach but apparently it is good enough for them).

chris-ha458 avatar Aug 25 '23 11:08 chris-ha458