π¨βπΎ Add a potential temporary fix for flaky tests on macOS
π¦ Bug fix
Relates to #806
Summary
A potential solution to the problem of flaky tests on macOS is to include all known components in libignition-gazebo.so. I have tested this locally with Address Sanitizer and it seems to work.
Checklist
- [ ] Signed all commits for DCO
- [ ] Added tests
- [ ] Updated documentation (as needed)
- [ ] Updated migration guide (as needed)
- [ ]
codecheckpassed (See contributing) - [ ] All tests passed (See test coverage)
- [ ] While waiting for a review on your PR, please help review another open pull request to support the maintainers
Note to maintainers: Remember to use Squash-Merge
include all known components in libignition-gazebo.so
Do you think we could get a similar effect by including them into the test executable? That would solve the test's issue while leaving the core library intact.
I can't see the results from CI, and it looks like 3 tests failed on Homebrew. I'll sync this with ign-gazebo3 to retrigger CI.
I think some, if not all, of the remaining failures might be fixed by https://github.com/ignitionrobotics/ign-common/pull/227.
So you think they're not related to how the components are registered, as explained on #806?
One thing all these test failures have in common is a segfault at startup, even before any console messages are printed.
Codecov Report
Merging #807 (6cb1354) into ign-gazebo5 (b45f60f) will decrease coverage by
0.09%. The diff coverage is100.00%.
@@ Coverage Diff @@
## ign-gazebo5 #807 +/- ##
===============================================
- Coverage 67.04% 66.95% -0.10%
===============================================
Files 276 276
Lines 21049 21023 -26
===============================================
- Hits 14112 14075 -37
- Misses 6937 6948 +11
| Impacted Files | Coverage Ξ | |
|---|---|---|
| src/SdfEntityCreator.cc | 85.41% <ΓΈ> (ΓΈ) |
|
| src/SimulationRunner.cc | 91.88% <100.00%> (-2.05%) |
:arrow_down: |
| src/Barrier.cc | 96.15% <0.00%> (-3.85%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Ξ = absolute <relative> (impact),ΓΈ = not affected,? = missing dataPowered by Codecov. Last update b45f60f...6cb1354. Read the comment docs.
FWIW, there were no homebrew failures in the latest run :grimacing:
I've disabled multi-threading in https://github.com/ignitionrobotics/ign-gazebo/pull/807/commits/6ac5110322dd43d9bfd76f2576d6e44cc2db0413. Let's what this does.
I've disabled multi-threading in 6ac5110. Let's what this does.
It's tough to tell from just one CI run because these failures are very flaky. It would be nice to be able to set something like colcon's --retest-until-fail on CI
I've disabled multi-threading in 6ac5110. Let's what this does.
It's tough to tell from just one CI run because these failures are very flaky. It would be nice to be able to set something like colcon's
--retest-until-failon CI
we currently have a script for running ctest with --rerun-failed if the RERUN_FAILED_TESTS variable is set to 1, though we only use it for gazebo CI
- https://github.com/ignition-tooling/release-tools/blob/master/jenkins-scripts/lib/make_test_rerun_failed.bash#L17
- https://github.com/ignition-tooling/release-tools/blob/c819f8682070ee88a9ac80b0983205c5cda5141f/jenkins-scripts/docker/lib/gazebo-base-default.bash#L194
- https://github.com/ignition-tooling/release-tools/blob/c819f8682070ee88a9ac80b0983205c5cda5141f/jenkins-scripts/gazebo-default-devel-homebrew-amd64.bash#L13
we could add something similar for --retest-until-fail
I've disabled multi-threading in 6ac5110. Let's what this does.
It's tough to tell from just one CI run because these failures are very flaky. It would be nice to be able to set something like colcon's
--retest-until-failon CIwe currently have a script for running
ctestwith--rerun-failedif theRERUN_FAILED_TESTSvariable is set to1, though we only use it for gazebo CI
- https://github.com/ignition-tooling/release-tools/blob/master/jenkins-scripts/lib/make_test_rerun_failed.bash#L17
- https://github.com/ignition-tooling/release-tools/blob/c819f8682070ee88a9ac80b0983205c5cda5141f/jenkins-scripts/docker/lib/gazebo-base-default.bash#L194
- https://github.com/ignition-tooling/release-tools/blob/c819f8682070ee88a9ac80b0983205c5cda5141f/jenkins-scripts/gazebo-default-devel-homebrew-amd64.bash#L13
we could add something similar for
--retest-until-fail
Yeah, ctest has repeat-until-fail. How about doing that on a custom RTOOLS_BRANCH? Is https://github.com/ignition-tooling/release-tools/issues/242 still an issue?
How about doing that on a custom RTOOLS_BRANCH?
+1
Is ignition-tooling/release-tools#242 still an issue?
I believe it is, but it's flaky and I haven't seen it in a while.
https://build.osrfoundation.org/job/ignition_gazebo-ci-pr_any-homebrew-amd64/5997/ ran each test until it failed with a maximum of 100 runs each. The logs show:
The following tests FAILED:
39 - UNIT_PeerTracker_TEST (Failed)
107 - INTEGRATION_network_handshake (Subprocess aborted)
119 - INTEGRATION_scene_broadcaster_system (SEGFAULT)
I suspect the failure in UNIT_PeerTracker_TEST might be a timing/networking issue (https://github.com/ignitionrobotics/ign-gazebo/pull/452). The INTEGRATION_network_handshake test uses multiple threads so it might be a race condition on console stream objects (https://github.com/ignitionrobotics/ign-common/pull/227). I don't have any guesses as to what's causing the segfault in INTEGRATION_scene_broadcaster_system
For comparision, I'll run ign-gazebo3 with --repeat-until-fail and report back.
In the interest of time, I only did --repeat-until-fail=10 for the ign-gazebo3 branch: https://build.osrfoundation.org/view/ign-citadel/job/ignition_gazebo-ci-pr_any-homebrew-amd64/6020
The logs show:
The following tests FAILED:
39 - UNIT_PeerTracker_TEST (Failed)
59 - INTEGRATION_breadcrumbs (SEGFAULT)
71 - INTEGRATION_each_new_removed (Subprocess aborted)
93 - INTEGRATION_level_manager_runtime_performers (SEGFAULT)
111 - INTEGRATION_physics_system (SEGFAULT)
115 - INTEGRATION_pose_publisher_system (SEGFAULT)
119 - INTEGRATION_scene_broadcaster_system (SEGFAULT)
I'm guessing INTEGRATION_network_handshake would have failed to if I had run it 100 times.
So we're testing 2 different things at once here. The header includes and the worker threads. I think we should examine them separately.
If it turns out the worker threads are the problem, I believe the next step could be to disable multi-threading just for macOS until we have time to try and address the underlying issue. I'm assuming no one has time to debug these failures on a mac right now. I think it's worth sacrificing performance for reliability for now (for Mac users and CI).
Now that ign-common 4.5 has been released, we've have had much better homebrew CI https://build.osrfoundation.org/job/ignition_gazebo-ci-ign-gazebo5-homebrew-amd64/ (starting from build 53).
Last 20 builds
INTEGRATION_scene_broadcaster_system segfaulted in build 55 and UNIT_PeerTracker_TEST failed in build 58. I think this is very encouraging. I'll retarget this to ign-gazebo5 for now and run tests to see if the combination of this PR and https://github.com/ignitionrobotics/ign-common/pull/227 show an improvement.
I have attached the test result csv file for posterity: Test Results.csv
@azeey , Edifice will EOL next week. Do you want to retarget this to ign-gazebo6?
@azeey , Edifice will EOL next week. Do you want to retarget this to
ign-gazebo6?
Yup. Will do.
Closing in favor of #1836