CCF
CCF copied to clipboard
Try running all e2e tests in TSAN
Surprised to discover that this if (NOT TSAN) block gates so many tests. Believe many should now work - let's see what the CI says.
The failures are so verbose we need to look at the raw logs, but on the first run these are the failing tests:
2024-11-07T15:18:36.9778496Z The following tests FAILED:
2024-11-07T15:18:36.9779075Z 40 - recovery_test_cft_api_0 (Failed)
2024-11-07T15:18:36.9779535Z 41 - recovery_test_cft_api_1 (Failed)
2024-11-07T15:18:36.9779971Z 42 - recovery_test_suite (Failed)
2024-11-07T15:18:36.9780411Z 43 - reconfiguration_test_suite (Failed)
2024-11-07T15:18:36.9780881Z 44 - regression_test_suite (Failed)
2024-11-07T15:18:36.9781299Z 45 - full_test_suite (Failed)
2024-11-07T15:18:36.9781683Z 47 - commit_latency (Failed)
2024-11-07T15:18:36.9782045Z 50 - auth (Failed)
2024-11-07T15:18:36.9782386Z 52 - governance_test (Failed)
2024-11-07T15:18:36.9782758Z 53 - jwt_test (Failed)
2024-11-07T15:18:36.9783289Z 55 - e2e_logging_cft (Failed)
2024-11-07T15:18:36.9783689Z 59 - e2e_logging_http2 (Failed)
2024-11-07T15:18:36.9784172Z 60 - membership_api_0 (Failed)
2024-11-07T15:18:36.9784565Z 66 - lts_compatibility (Failed)
2024-11-07T15:18:36.9784948Z 70 - acme_endorsement_test (Failed)
I've got stacks for some missing mutexes and mutex inversions around the snapshotter, which is likely the recovery tests. Will investigate the others.
First change knocks out of a few of those failures already:
2024-11-07T16:20:07.7189192Z 40 - recovery_test_cft_api_0 (Failed)
2024-11-07T16:20:07.7189554Z 41 - recovery_test_cft_api_1 (Failed)
2024-11-07T16:20:07.7189916Z 44 - regression_test_suite (Failed)
2024-11-07T16:20:07.7190245Z 45 - full_test_suite (Failed)
2024-11-07T16:20:07.7190561Z 52 - governance_test (Failed)
2024-11-07T16:20:07.7190874Z 55 - e2e_logging_cft (Failed)
2024-11-07T16:20:07.7191191Z 59 - e2e_logging_http2 (Failed)
2024-11-07T16:20:07.7191519Z 61 - membership_api_1 (Failed)
2024-11-07T16:20:07.7191848Z 66 - lts_compatibility (Failed)
2024-11-07T16:20:07.7192171Z 70 - acme_endorsement_test (Failed)
acme_endorsement_test is unrelated, pebble isn't installed.
Worryingly we may be missing some TSAN information from the unit tests - they're either muzzled by the test wrapper, or non-fatal warnings:
$ TSAN_OPTIONS=second_deadlock_stack=1 ./snapshot_test
[doctest] doctest version is "2.4.11"
[doctest] run with "--help" for options
==================
WARNING: ThreadSanitizer: lock-order-inversion (potential deadlock) (pid=360032)
Cycle in lock order graph: M0 (0x7b4400000be8) => M1 (0x7fff03361dc8) => M0
Mutex M1 acquired here while holding mutex M0 in main thread:
#0 pthread_mutex_lock <null> (snapshot_test+0x83a0a) (BuildId: 6ef6b264fe1f8e764247d52773f4c39cfad93b37)
#1 std::__1::mutex::lock() <null> (libc++.so.1+0x4af15) (BuildId: e3dee72a81fed73680e4d05b6858c5327d95f499)
#2 ccf::kv::Store::get_map(unsigned long, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) /data/src/2.CCF/build.san/../src/kv/store.h:238:40 (snapshot_test+0x1c04cc) (BuildId: 6ef6b264fe1f8e764247d52773f4c39cfad93b37)
...
SUMMARY: ThreadSanitizer: lock-order-inversion (potential deadlock) (/data/src/2.CCF/build.san/snapshot_test+0x83a0a) (BuildId: 6ef6b264fe1f8e764247d52773f4c39cfad93b37) in pthread_mutex_lock
==================
===============================================================================
[doctest] test cases: 1 | 1 passed | 0 failed | 0 skipped
[doctest] assertions: 13 | 13 passed | 0 failed |
[doctest] Status: SUCCESS!
ThreadSanitizer: reported 1 warnings
Closing this PR, superceded by @maxtropets' work in #7201/#7232.