oneDNN
oneDNN copied to clipboard
test_concurrency intermittent segfault inside docker
Summary
test_concurrency
seg faults intermittently (~1 in 50), for example by running
ctest --repeat-until-fail 200 -R concurrency
which outputs
...
The following tests FAILED:
82 - test_concurrency (SEGFAULT)
Errors while running CTest
This was first observed on CI (https://cloud.drone.io/oneapi-src/oneDNN/1380/3/2) but is also reproduceable on master (currently 51ad89de16e35f5212ad96511bf3074808830894) using a c6gd.4xlarge by manually running commands for clang-test
in .drone.yml
.
Note that I have only been able to reproduce in docker. Running the same commands outside of a docker container did not produce a seg fault after ~5000 runs.
This may or may not be related but the time that the test takes to run grows rapidly when you repeat the gtest directly, for example ./test_concurrency --gtest_repeat=10
. This is not the case when you repeat using ctest's --repeat-until-fail
, I assume this is due to difference in the way the tests setup/teardown. Another interesting thing I noticed was that the test takes ~3 times longer inside docker than outside. Also, the test occasionally takes a lot longer, usually taking <1s but occasionally taking >10s.
Environment
- CPU: c6gd.4xlarge and whichever arm64 CPU droneCI uses
- OS: ubuntu 20.04 and ubuntu 18.04
- Compiler: clang version 9.0.0-2~ubuntu18.04.2 (tags/RELEASE_900/final)
- cmake version 3.10.2
- cmake output: see CI run (https://cloud.drone.io/oneapi-src/oneDNN/1380/3/2)
- Can reproduce on current master 51ad89de16e35f5212ad96511bf3074808830894 and a previous commit on master 11fa74eaf03af9848c1bb5fffb4cbb2866aadf42
+@echeresh
@jondea, is this issue still reproducible?
Hi @vpirogov, I've just reproduced this on the latest master fd16b15d4c53a930c771a719ce7ed6e2def6ad2d using the same setup.
One slight difference is that I managed to reproduce outside of a docker this time (it may have just been chance that I couldn't last time). Although it's still only clang. It still fails about 1 in 50 for clang, but it didn't fail after 10,000 runs with gcc.