oneDNN icon indicating copy to clipboard operation
oneDNN copied to clipboard

test_concurrency intermittent segfault inside docker

Open jondea opened this issue 2 years ago • 4 comments

Summary

test_concurrency seg faults intermittently (~1 in 50), for example by running

ctest --repeat-until-fail 200 -R concurrency

which outputs

...
The following tests FAILED:
         82 - test_concurrency (SEGFAULT)
Errors while running CTest

This was first observed on CI (https://cloud.drone.io/oneapi-src/oneDNN/1380/3/2) but is also reproduceable on master (currently 51ad89de16e35f5212ad96511bf3074808830894) using a c6gd.4xlarge by manually running commands for clang-test in .drone.yml.

Note that I have only been able to reproduce in docker. Running the same commands outside of a docker container did not produce a seg fault after ~5000 runs.

This may or may not be related but the time that the test takes to run grows rapidly when you repeat the gtest directly, for example ./test_concurrency --gtest_repeat=10. This is not the case when you repeat using ctest's --repeat-until-fail, I assume this is due to difference in the way the tests setup/teardown. Another interesting thing I noticed was that the test takes ~3 times longer inside docker than outside. Also, the test occasionally takes a lot longer, usually taking <1s but occasionally taking >10s.

Environment

  • CPU: c6gd.4xlarge and whichever arm64 CPU droneCI uses
  • OS: ubuntu 20.04 and ubuntu 18.04
  • Compiler: clang version 9.0.0-2~ubuntu18.04.2 (tags/RELEASE_900/final)
  • cmake version 3.10.2
  • cmake output: see CI run (https://cloud.drone.io/oneapi-src/oneDNN/1380/3/2)
  • Can reproduce on current master 51ad89de16e35f5212ad96511bf3074808830894 and a previous commit on master 11fa74eaf03af9848c1bb5fffb4cbb2866aadf42

jondea avatar May 11 '22 10:05 jondea

+@echeresh

densamoilov avatar May 11 '22 19:05 densamoilov

@jondea, is this issue still reproducible?

vpirogov avatar Jan 25 '23 18:01 vpirogov

Hi @vpirogov, I've just reproduced this on the latest master fd16b15d4c53a930c771a719ce7ed6e2def6ad2d using the same setup.

jondea avatar Feb 02 '23 09:02 jondea

One slight difference is that I managed to reproduce outside of a docker this time (it may have just been chance that I couldn't last time). Although it's still only clang. It still fails about 1 in 50 for clang, but it didn't fail after 10,000 runs with gcc.

jondea avatar Feb 02 '23 10:02 jondea