CCF icon indicating copy to clipboard operation
CCF copied to clipboard

Try running all e2e tests in TSAN

Open eddyashton opened this issue 1 year ago • 2 comments

Surprised to discover that this if (NOT TSAN) block gates so many tests. Believe many should now work - let's see what the CI says.

eddyashton avatar Nov 07 '24 14:11 eddyashton

The failures are so verbose we need to look at the raw logs, but on the first run these are the failing tests:

2024-11-07T15:18:36.9778496Z The following tests FAILED:
2024-11-07T15:18:36.9779075Z 	 40 - recovery_test_cft_api_0 (Failed)
2024-11-07T15:18:36.9779535Z 	 41 - recovery_test_cft_api_1 (Failed)
2024-11-07T15:18:36.9779971Z 	 42 - recovery_test_suite (Failed)
2024-11-07T15:18:36.9780411Z 	 43 - reconfiguration_test_suite (Failed)
2024-11-07T15:18:36.9780881Z 	 44 - regression_test_suite (Failed)
2024-11-07T15:18:36.9781299Z 	 45 - full_test_suite (Failed)
2024-11-07T15:18:36.9781683Z 	 47 - commit_latency (Failed)
2024-11-07T15:18:36.9782045Z 	 50 - auth (Failed)
2024-11-07T15:18:36.9782386Z 	 52 - governance_test (Failed)
2024-11-07T15:18:36.9782758Z 	 53 - jwt_test (Failed)
2024-11-07T15:18:36.9783289Z 	 55 - e2e_logging_cft (Failed)
2024-11-07T15:18:36.9783689Z 	 59 - e2e_logging_http2 (Failed)
2024-11-07T15:18:36.9784172Z 	 60 - membership_api_0 (Failed)
2024-11-07T15:18:36.9784565Z 	 66 - lts_compatibility (Failed)
2024-11-07T15:18:36.9784948Z 	 70 - acme_endorsement_test (Failed)

I've got stacks for some missing mutexes and mutex inversions around the snapshotter, which is likely the recovery tests. Will investigate the others.

eddyashton avatar Nov 07 '24 15:11 eddyashton

First change knocks out of a few of those failures already:

2024-11-07T16:20:07.7189192Z 	 40 - recovery_test_cft_api_0 (Failed)
2024-11-07T16:20:07.7189554Z 	 41 - recovery_test_cft_api_1 (Failed)
2024-11-07T16:20:07.7189916Z 	 44 - regression_test_suite (Failed)
2024-11-07T16:20:07.7190245Z 	 45 - full_test_suite (Failed)
2024-11-07T16:20:07.7190561Z 	 52 - governance_test (Failed)
2024-11-07T16:20:07.7190874Z 	 55 - e2e_logging_cft (Failed)
2024-11-07T16:20:07.7191191Z 	 59 - e2e_logging_http2 (Failed)
2024-11-07T16:20:07.7191519Z 	 61 - membership_api_1 (Failed)
2024-11-07T16:20:07.7191848Z 	 66 - lts_compatibility (Failed)
2024-11-07T16:20:07.7192171Z 	 70 - acme_endorsement_test (Failed)

acme_endorsement_test is unrelated, pebble isn't installed.

Worryingly we may be missing some TSAN information from the unit tests - they're either muzzled by the test wrapper, or non-fatal warnings:

$ TSAN_OPTIONS=second_deadlock_stack=1  ./snapshot_test 
[doctest] doctest version is "2.4.11"
[doctest] run with "--help" for options
==================
WARNING: ThreadSanitizer: lock-order-inversion (potential deadlock) (pid=360032)
  Cycle in lock order graph: M0 (0x7b4400000be8) => M1 (0x7fff03361dc8) => M0

  Mutex M1 acquired here while holding mutex M0 in main thread:
    #0 pthread_mutex_lock <null> (snapshot_test+0x83a0a) (BuildId: 6ef6b264fe1f8e764247d52773f4c39cfad93b37)
    #1 std::__1::mutex::lock() <null> (libc++.so.1+0x4af15) (BuildId: e3dee72a81fed73680e4d05b6858c5327d95f499)
    #2 ccf::kv::Store::get_map(unsigned long, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) /data/src/2.CCF/build.san/../src/kv/store.h:238:40 (snapshot_test+0x1c04cc) (BuildId: 6ef6b264fe1f8e764247d52773f4c39cfad93b37)
...

SUMMARY: ThreadSanitizer: lock-order-inversion (potential deadlock) (/data/src/2.CCF/build.san/snapshot_test+0x83a0a) (BuildId: 6ef6b264fe1f8e764247d52773f4c39cfad93b37) in pthread_mutex_lock
==================
===============================================================================
[doctest] test cases:  1 |  1 passed | 0 failed | 0 skipped
[doctest] assertions: 13 | 13 passed | 0 failed |
[doctest] Status: SUCCESS!
ThreadSanitizer: reported 1 warnings

eddyashton avatar Nov 07 '24 16:11 eddyashton

Closing this PR, superceded by @maxtropets' work in #7201/#7232.

eddyashton avatar Sep 02 '25 08:09 eddyashton