CacheLib icon indicating copy to clipboard operation
CacheLib copied to clipboard

Potential False Positive Data Races

Open mcengija opened this issue 2 years ago • 12 comments

@byrnedj and I have been running some cachebench tests and we think we found some data races that are false positives and wanted to provide the output we gathered. This was done with having thread sanitizer enabled while building with clang.

An example of the first output (findEviction) occurs in the eviction code path - while the mmContainer iterator lock is held. Another thread attempts to get the eviction iterator and the race is detected. The tests were ran with the default settings using feature_stress/dram_cache.json and consistency/ram-with-deletes.json.

Below are the outputs of the two that were detected. findEviction_Output.txt removeLocked_Output.txt

mcengija avatar Jun 29 '22 04:06 mcengija

I took a look at the outputs. Seems they are both caused by the data race in unlink() function. Both threads are trying to make an allocation by findEviction. One thread reaches getNext() and another thread reaches setNext().

cc @therealgymmy @sathyaphoenix Any thoughts on whether we should fix it?

jiayuebao avatar Jun 29 '22 17:06 jiayuebao

@mcengija Do you mind give me the command you used to repro the error? Also, did you get this error when running other cachebench tests?

jiayuebao avatar Jul 08 '22 00:07 jiayuebao

Hey @jiayuebao the command I used was a cachebench test e.g. ./cachelib/bin/cachebench --json_test_config cachelib/cachebench/test_configs/feature_stress/dram_cache.json and ./cachelib/bin/cachebench --json_test_config cachelib/cachebench/test_configs/consistency/ram-withdeletes.json

There is a slight change with build-cachelib folder though. Within ccmake the changes I made are:

Make sure to have these settings:

BUILD_TESTS = OFF CMAKE_BUILD_TYPE = Debug CMAKE_CXX_COMPILER = clang++ CMAKE_CXX_FLAGS = -fsanitize=thread -latomic CMAKE_C_COMPILER = clang

Make sure to have libtsan and libatomic installed. Otherwise, the build will fail.

@byrnedj and I noticed this in other tests we were doing, so we decided to run these other tests in the main branch to confirm we saw consistent output.

mcengija avatar Jul 08 '22 22:07 mcengija

@mcengija After some exploration, we feel TSAN is reporting this data race as folly::DistributedMutex is not correctly detected. The data race is caused when 2 threads are accessing the same eviction iterator at the same time. However we already added lock to the iterator https://github.com/facebook/CacheLib/blob/main/cachelib/allocator/MMLru-inl.h#L217, so such data race shouldn't actually happen. For some reason, I cannot repro the issue from my side. Could you try to change the Mutex https://github.com/facebook/CacheLib/blob/main/cachelib/allocator/MMLru.h#L212 to some other type that is supported by TSAN (e.g. folly::SharedMutex) and check whether data race issue would disappear?

jiayuebao avatar Jul 13 '22 00:07 jiayuebao

Just confirmed that folly::DistributedMutex isn't annotated thus won't be known by TSAN. @mcengija Did you try any other annotated mutex to see whether the issue would disappear?

jiayuebao avatar Jul 18 '22 22:07 jiayuebao

Hey @jiayuebao I haven't had a chance yet. I was planning on attempting the suggestion above with switching to folly::SharedMutex on Wednesday. Finishing a few other requested tests by the team before I can try that switch. I'll let you know by Wednesday late evening.

Ignore the closed. Mis-clicked.

mcengija avatar Jul 18 '22 22:07 mcengija

@mcengija Any update?

jiayuebao avatar Jul 22 '22 17:07 jiayuebao

Hey, small update with the build. It was failing and unfortunately @byrnedj was out sick for the week, so we decided to put a hold on it, while I worked on two other cachelib issues. @byrnedj and I plan to work on it start of next week and hope to have some output for you by Tuesday. Sorry for the delay.

mcengija avatar Jul 22 '22 18:07 mcengija

@jiayuebao quick question for you... What version of TSAN, clang (or gcc) did you use when trying to replicate the results?

In addition, its difficult to debug another locktype when it doesn't implement lock_combine.

mcengija avatar Jul 22 '22 20:07 mcengija

Hey, small update with the build. It was failing.

Could you share more information about the build failure? And is the failure blocking you from debugging further?

I also got build failure when running ./contrib/build.sh -j -O:

[  0%] Building CXX object CMakeFiles/folly_base.dir/folly/compression/Compression.cpp.o
/home/jiayueb/CacheLib/cachelib/external/folly/folly/compression/Compression.cpp:39:10: fatal error: lzma.h: No such file or directory
 #include <lzma.h>
          ^~~~~~~~
compilation terminated.
make[2]: * [CMakeFiles/folly_base.dir/build.make:552: CMakeFiles/folly_base.dir/folly/compression/Compression.cpp.o] Error 1
make[1]: * [CMakeFiles/Makefile2:197: CMakeFiles/folly_base.dir/all] Error 2
make: * [Makefile:136: all] Error 2
[build-package.sh](http://build-package.sh/): error: make failed
[build.sh](http://build.sh/): error: failed to build dependency 'folly'

Not sure whether you got the same error. cc @agordon

What version of TSAN, clang (or gcc) did you use

Since the build failed, I couldn't use ./cachelib/bin/cachebench. I used our internal command buck to run the code under TSAN.

jiayuebao avatar Jul 25 '22 19:07 jiayuebao

Hey @jiayuebao, I can build all the necessary dependencies successfully. The only one that doesn't build is cachelib. This is because when switching from folly:DistributedMutex to folly:SharedMutex build error of has no member named lock_combine.

Can you clarify on what you mean by:

I used our internal command buck to run the code under TSAN.

If you are able to build internally, do you get the same error as above when switching to folly:SharedMutex?

mcengija avatar Jul 26 '22 22:07 mcengija

@mcengija:

folly:SharedMutex build error of has no member named lock_combine

Let me try to clarify a bit. This build error is expected since DistributedMutex is one of the few locks we have internally that support flat-combining to reduce contention on critical section. folly::SharedMutex doesn't support it and has no corresponding API to lock_combine.

When you switch to folly::SharedMutex, you actually need to refactor the code in lock_combine. For example, instead of the following:

    auto func = [this, &node, curr]() {
       // ... critical section
    };

    lruMutex_->lock_combine(func);

You want to re-factor it as:

    auto func = [this, &node, curr]() {
       // ... critical section
    };
    {
      auto lock_guard = folly::SharedMutex::WriteHolder{lock_};
      func();
    }

Alternatively, you can also just try std::mutex since accesses to LRU should always be exclusive. Follow the same logic as folly::SharedMutex.

therealgymmy avatar Jul 27 '22 00:07 therealgymmy

Hey @mcengija: check again on any update?

jiayuebao avatar Aug 22 '22 20:08 jiayuebao

Hey @jiayuebao sorry for the delay. The code was refactored this week and we had time to review on Thursday. @byrnedj and I went through and reviewed it and he brought updates to the rest of your team. We have concluded that our warnings went away when using std::mutex in mmContainer.

Thanks!

mcengija avatar Aug 26 '22 18:08 mcengija