This adds a wrapper script to re-run tests for known flakes. Known flake error messages can be added to the array KNOWN_FLAKES and the script will re-run the tests if one of the known flake error messages appears.

The script will try to re-run the failing tests for $max_retries. Most used CI systems have a time limit, so max_retries should probably not be larger than 3. If a test fails for 3 times maybe something really needs to be fixed.

Motivation for this script was that we currently just re-run CI if we see a known flake. This script tries to automate the step of automatically re-running CI for known flakes.

Dec 04 '21 17:12 adrianreber

Codecov Report

Merging #1677 (f74fa48) into criu-dev (6754b16) will decrease coverage by 0.18%. The diff coverage is n/a.

@@             Coverage Diff              @@
##           criu-dev    #1677      +/-   ##
============================================
- Coverage     70.53%   70.34%   -0.19%     
============================================
  Files           126      125       -1     
  Lines         31071    31883     +812     
============================================
+ Hits          21916    22429     +513     
- Misses         9155     9454     +299

Impacted Files	Coverage Δ
include/common/lock.h	`89.18% <0.00%> (-4.57%)`	:arrow_down:
criu/files-reg.c	`75.42% <0.00%> (-1.14%)`	:arrow_down:
criu/uffd.c	`79.36% <0.00%> (-0.32%)`	:arrow_down:
criu/mount.c	`75.59% <0.00%> (-0.07%)`	:arrow_down:
criu/fdstore.c	`61.19% <0.00%> (ø)`
criu/include/pid.h	`100.00% <0.00%> (ø)`
criu/include/vma.h	`100.00% <0.00%> (ø)`
include/common/list.h	`100.00% <0.00%> (ø)`
criu/include/parasite.h	`100.00% <0.00%> (ø)`
criu/arch/x86/include/asm/types.h	`100.00% <0.00%> (ø)`
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 6754b16...f74fa48. Read the comment docs.

Dec 04 '21 17:12 codecov-commenter

This is the way to hell...

Dec 05 '21 22:12 avagin

This is the way to hell...

:smile: I kind of agree, but it automates what I currently do manually.

This change only reruns the tests for known flakes (currently there are two defined in the PR) and that is what I currently do manually. If I see a failure like the TLS related errors in the page server, I know that it sometimes does not work. I just re-run the test and it passes the second time. From that point of view it just removes me re-running the test.

Dec 07 '21 16:12 adrianreber

Can we fix these flakes?

Dec 07 '21 17:12 avagin

I will try to prioritize fixing the problem with TLS in page server (https://github.com/checkpoint-restore/criu/issues/1380).

Dec 08 '21 07:12 rst0git

Can we fix these flakes?

Of course, but the TLS problem is open for almost 10 months and I was not able to replicate it locally. Hopefully @rst0git has a chance to fix it.

Some of the network tests (the second flake I added to the exception list in this PR) are also failing sometimes because some socket operation fails for some unclear race condition.

There is also the pthread_timers test which needs to be re-run sometimes.

If we can fix tests differently then this would be the much better solution, but there are a couple of tests which fail sometime.

I will leave this as WIP for a bit longer. Maybe @rst0git finds a solution for the TLS problem which is happens most often. Then we can just close this PR. Although fixing tests correctly is the much better solution if there are known flakes it maybe makes sense to have those tests run a second time automatically. Maybe have a way to include this functionality in zdtm and only enable it for a few known tests.

Dec 08 '21 07:12 adrianreber

@adrianreber I opened a PR where I tried to address some of the issues with known flake tests https://github.com/checkpoint-restore/criu/pull/1690

With these changes I don't see any errors locally. I replicated that pull request 12 times in my fork and it looks like CI is always passing:

Dec 15 '21 07:12 rst0git

A friendly reminder that this PR had no activity for 30 days.

Feb 03 '22 00:02 github-actions[bot]

criu
criu copied to clipboard

ci: re-run known flakes

Codecov Report

criu criu copied to clipboard

ci: re-run known flakes

Codecov Report

criu
criu copied to clipboard