criu icon indicating copy to clipboard operation
criu copied to clipboard

ci: re-run known flakes

Open adrianreber opened this issue 3 years ago • 8 comments

This adds a wrapper script to re-run tests for known flakes. Known flake error messages can be added to the array KNOWN_FLAKES and the script will re-run the tests if one of the known flake error messages appears.

The script will try to re-run the failing tests for $max_retries. Most used CI systems have a time limit, so max_retries should probably not be larger than 3. If a test fails for 3 times maybe something really needs to be fixed.

Motivation for this script was that we currently just re-run CI if we see a known flake. This script tries to automate the step of automatically re-running CI for known flakes.

adrianreber avatar Dec 04 '21 17:12 adrianreber

Codecov Report

Merging #1677 (f74fa48) into criu-dev (6754b16) will decrease coverage by 0.18%. The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##           criu-dev    #1677      +/-   ##
============================================
- Coverage     70.53%   70.34%   -0.19%     
============================================
  Files           126      125       -1     
  Lines         31071    31883     +812     
============================================
+ Hits          21916    22429     +513     
- Misses         9155     9454     +299     
Impacted Files Coverage Δ
include/common/lock.h 89.18% <0.00%> (-4.57%) :arrow_down:
criu/files-reg.c 75.42% <0.00%> (-1.14%) :arrow_down:
criu/uffd.c 79.36% <0.00%> (-0.32%) :arrow_down:
criu/mount.c 75.59% <0.00%> (-0.07%) :arrow_down:
criu/fdstore.c 61.19% <0.00%> (ø)
criu/include/pid.h 100.00% <0.00%> (ø)
criu/include/vma.h 100.00% <0.00%> (ø)
include/common/list.h 100.00% <0.00%> (ø)
criu/include/parasite.h 100.00% <0.00%> (ø)
criu/arch/x86/include/asm/types.h 100.00% <0.00%> (ø)
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 6754b16...f74fa48. Read the comment docs.

codecov-commenter avatar Dec 04 '21 17:12 codecov-commenter

This is the way to hell...

avagin avatar Dec 05 '21 22:12 avagin

This is the way to hell...

:smile: I kind of agree, but it automates what I currently do manually.

This change only reruns the tests for known flakes (currently there are two defined in the PR) and that is what I currently do manually. If I see a failure like the TLS related errors in the page server, I know that it sometimes does not work. I just re-run the test and it passes the second time. From that point of view it just removes me re-running the test.

adrianreber avatar Dec 07 '21 16:12 adrianreber

Can we fix these flakes?

avagin avatar Dec 07 '21 17:12 avagin

I will try to prioritize fixing the problem with TLS in page server (https://github.com/checkpoint-restore/criu/issues/1380).

rst0git avatar Dec 08 '21 07:12 rst0git

Can we fix these flakes?

Of course, but the TLS problem is open for almost 10 months and I was not able to replicate it locally. Hopefully @rst0git has a chance to fix it.

Some of the network tests (the second flake I added to the exception list in this PR) are also failing sometimes because some socket operation fails for some unclear race condition.

There is also the pthread_timers test which needs to be re-run sometimes.

If we can fix tests differently then this would be the much better solution, but there are a couple of tests which fail sometime.

I will leave this as WIP for a bit longer. Maybe @rst0git finds a solution for the TLS problem which is happens most often. Then we can just close this PR. Although fixing tests correctly is the much better solution if there are known flakes it maybe makes sense to have those tests run a second time automatically. Maybe have a way to include this functionality in zdtm and only enable it for a few known tests.

adrianreber avatar Dec 08 '21 07:12 adrianreber

@adrianreber I opened a PR where I tried to address some of the issues with known flake tests https://github.com/checkpoint-restore/criu/pull/1690

With these changes I don't see any errors locally. I replicated that pull request 12 times in my fork and it looks like CI is always passing:

rst0git avatar Dec 15 '21 07:12 rst0git

A friendly reminder that this PR had no activity for 30 days.

github-actions[bot] avatar Feb 03 '22 00:02 github-actions[bot]