CacheLib
CacheLib copied to clipboard
Potential False Positive Data Races
@byrnedj and I have been running some cachebench tests and we think we found some data races that are false positives and wanted to provide the output we gathered. This was done with having thread sanitizer enabled while building with clang.
An example of the first output (findEviction) occurs in the eviction code path - while the mmContainer iterator lock is held. Another thread attempts to get the eviction iterator and the race is detected. The tests were ran with the default settings using feature_stress/dram_cache.json and consistency/ram-with-deletes.json.
Below are the outputs of the two that were detected. findEviction_Output.txt removeLocked_Output.txt
I took a look at the outputs. Seems they are both caused by the data race in unlink()
function. Both threads are trying to make an allocation by findEviction. One thread reaches getNext()
and another thread reaches setNext()
.
cc @therealgymmy @sathyaphoenix Any thoughts on whether we should fix it?
@mcengija Do you mind give me the command you used to repro the error? Also, did you get this error when running other cachebench tests?
Hey @jiayuebao the command I used was a cachebench test e.g. ./cachelib/bin/cachebench --json_test_config cachelib/cachebench/test_configs/feature_stress/dram_cache.json
and ./cachelib/bin/cachebench --json_test_config cachelib/cachebench/test_configs/consistency/ram-withdeletes.json
There is a slight change with build-cachelib folder though. Within ccmake the changes I made are:
Make sure to have these settings:
BUILD_TESTS = OFF CMAKE_BUILD_TYPE = Debug CMAKE_CXX_COMPILER = clang++ CMAKE_CXX_FLAGS = -fsanitize=thread -latomic CMAKE_C_COMPILER = clang
Make sure to have libtsan and libatomic installed. Otherwise, the build will fail.
@byrnedj and I noticed this in other tests we were doing, so we decided to run these other tests in the main branch to confirm we saw consistent output.
@mcengija After some exploration, we feel TSAN is reporting this data race as folly::DistributedMutex
is not correctly detected. The data race is caused when 2 threads are accessing the same eviction iterator at the same time. However we already added lock to the iterator https://github.com/facebook/CacheLib/blob/main/cachelib/allocator/MMLru-inl.h#L217, so such data race shouldn't actually happen. For some reason, I cannot repro the issue from my side. Could you try to change the Mutex
https://github.com/facebook/CacheLib/blob/main/cachelib/allocator/MMLru.h#L212 to some other type that is supported by TSAN (e.g. folly::SharedMutex
) and check whether data race issue would disappear?
Just confirmed that folly::DistributedMutex
isn't annotated thus won't be known by TSAN. @mcengija Did you try any other annotated mutex to see whether the issue would disappear?
Hey @jiayuebao I haven't had a chance yet. I was planning on attempting the suggestion above with switching to folly::SharedMutex
on Wednesday. Finishing a few other requested tests by the team before I can try that switch. I'll let you know by Wednesday late evening.
Ignore the closed. Mis-clicked.
@mcengija Any update?
Hey, small update with the build. It was failing and unfortunately @byrnedj was out sick for the week, so we decided to put a hold on it, while I worked on two other cachelib issues. @byrnedj and I plan to work on it start of next week and hope to have some output for you by Tuesday. Sorry for the delay.
@jiayuebao quick question for you... What version of TSAN, clang (or gcc) did you use when trying to replicate the results?
In addition, its difficult to debug another locktype when it doesn't implement lock_combine.
Hey, small update with the build. It was failing.
Could you share more information about the build failure? And is the failure blocking you from debugging further?
I also got build failure when running ./contrib/build.sh -j -O
:
[ 0%] Building CXX object CMakeFiles/folly_base.dir/folly/compression/Compression.cpp.o
/home/jiayueb/CacheLib/cachelib/external/folly/folly/compression/Compression.cpp:39:10: fatal error: lzma.h: No such file or directory
#include <lzma.h>
^~~~~~~~
compilation terminated.
make[2]: * [CMakeFiles/folly_base.dir/build.make:552: CMakeFiles/folly_base.dir/folly/compression/Compression.cpp.o] Error 1
make[1]: * [CMakeFiles/Makefile2:197: CMakeFiles/folly_base.dir/all] Error 2
make: * [Makefile:136: all] Error 2
[build-package.sh](http://build-package.sh/): error: make failed
[build.sh](http://build.sh/): error: failed to build dependency 'folly'
Not sure whether you got the same error. cc @agordon
What version of TSAN, clang (or gcc) did you use
Since the build failed, I couldn't use ./cachelib/bin/cachebench
. I used our internal command buck to run the code under TSAN.
Hey @jiayuebao, I can build all the necessary dependencies successfully. The only one that doesn't build is cachelib. This is because when switching from folly:DistributedMutex
to folly:SharedMutex
build error of has no member named lock_combine
.
Can you clarify on what you mean by:
I used our internal command buck to run the code under TSAN.
If you are able to build internally, do you get the same error as above when switching to folly:SharedMutex
?
@mcengija:
folly:SharedMutex build error of has no member named lock_combine
Let me try to clarify a bit. This build error is expected since DistributedMutex is one of the few locks we have internally that support flat-combining to reduce contention on critical section. folly::SharedMutex doesn't support it and has no corresponding API to lock_combine.
When you switch to folly::SharedMutex, you actually need to refactor the code in lock_combine. For example, instead of the following:
auto func = [this, &node, curr]() {
// ... critical section
};
lruMutex_->lock_combine(func);
You want to re-factor it as:
auto func = [this, &node, curr]() {
// ... critical section
};
{
auto lock_guard = folly::SharedMutex::WriteHolder{lock_};
func();
}
Alternatively, you can also just try std::mutex
since accesses to LRU should always be exclusive. Follow the same logic as folly::SharedMutex.
Hey @mcengija: check again on any update?
Hey @jiayuebao sorry for the delay. The code was refactored this week and we had time to review on Thursday. @byrnedj and I went through and reviewed it and he brought updates to the rest of your team. We have concluded that our warnings went away when using std::mutex
in mmContainer.
Thanks!