criu
criu copied to clipboard
ci: re-run known flakes
This adds a wrapper script to re-run tests for known flakes. Known flake error messages can be added to the array KNOWN_FLAKES and the script will re-run the tests if one of the known flake error messages appears.
The script will try to re-run the failing tests for $max_retries. Most used CI systems have a time limit, so max_retries should probably not be larger than 3. If a test fails for 3 times maybe something really needs to be fixed.
Motivation for this script was that we currently just re-run CI if we see a known flake. This script tries to automate the step of automatically re-running CI for known flakes.
Codecov Report
Merging #1677 (f74fa48) into criu-dev (6754b16) will decrease coverage by
0.18%. The diff coverage isn/a.
@@ Coverage Diff @@
## criu-dev #1677 +/- ##
============================================
- Coverage 70.53% 70.34% -0.19%
============================================
Files 126 125 -1
Lines 31071 31883 +812
============================================
+ Hits 21916 22429 +513
- Misses 9155 9454 +299
| Impacted Files | Coverage Δ | |
|---|---|---|
| include/common/lock.h | 89.18% <0.00%> (-4.57%) |
:arrow_down: |
| criu/files-reg.c | 75.42% <0.00%> (-1.14%) |
:arrow_down: |
| criu/uffd.c | 79.36% <0.00%> (-0.32%) |
:arrow_down: |
| criu/mount.c | 75.59% <0.00%> (-0.07%) |
:arrow_down: |
| criu/fdstore.c | 61.19% <0.00%> (ø) |
|
| criu/include/pid.h | 100.00% <0.00%> (ø) |
|
| criu/include/vma.h | 100.00% <0.00%> (ø) |
|
| include/common/list.h | 100.00% <0.00%> (ø) |
|
| criu/include/parasite.h | 100.00% <0.00%> (ø) |
|
| criu/arch/x86/include/asm/types.h | 100.00% <0.00%> (ø) |
|
| ... and 9 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 6754b16...f74fa48. Read the comment docs.
This is the way to hell...
This is the way to hell...
:smile: I kind of agree, but it automates what I currently do manually.
This change only reruns the tests for known flakes (currently there are two defined in the PR) and that is what I currently do manually. If I see a failure like the TLS related errors in the page server, I know that it sometimes does not work. I just re-run the test and it passes the second time. From that point of view it just removes me re-running the test.
Can we fix these flakes?
I will try to prioritize fixing the problem with TLS in page server (https://github.com/checkpoint-restore/criu/issues/1380).
Can we fix these flakes?
Of course, but the TLS problem is open for almost 10 months and I was not able to replicate it locally. Hopefully @rst0git has a chance to fix it.
Some of the network tests (the second flake I added to the exception list in this PR) are also failing sometimes because some socket operation fails for some unclear race condition.
There is also the pthread_timers test which needs to be re-run sometimes.
If we can fix tests differently then this would be the much better solution, but there are a couple of tests which fail sometime.
I will leave this as WIP for a bit longer. Maybe @rst0git finds a solution for the TLS problem which is happens most often. Then we can just close this PR. Although fixing tests correctly is the much better solution if there are known flakes it maybe makes sense to have those tests run a second time automatically. Maybe have a way to include this functionality in zdtm and only enable it for a few known tests.
@adrianreber I opened a PR where I tried to address some of the issues with known flake tests https://github.com/checkpoint-restore/criu/pull/1690
With these changes I don't see any errors locally. I replicated that pull request 12 times in my fork and it looks like CI is always passing:
A friendly reminder that this PR had no activity for 30 days.