dafny icon indicating copy to clipboard operation
dafny copied to clipboard

Organization of CI workflow impedes restarting individual jobs

Open davidcok opened this issue 1 year ago • 6 comments

Dafny version

4.1.0+dev

Code to produce this issue

No response

Command to run and resulting output

No response

What happened?

One unfortunate side-effect of adding the coverage reporting at the end of the main workflow is that now all those integration and xunit tests are in one workflow, and so if one fails (with a transient error), that job can't be restarted until all the jobs are complete.

What type of operating system are you experiencing the problem on?

Other

davidcok avatar May 30 '23 16:05 davidcok

From what I can tell, there doesn't seem to be a way to both 1) allow a job to depend on more than one prior job, and 2) allow one of those prior jobs to be restarted before the other finishes. Given that we really want to be able to merge unit and integration test coverage results, I think we may need to live with the current status unless GitHub implements something allowing more fine-grained control of jobs.

Something we could use in the meantime, however, is something that will automatically re-run a particular step in a job if it fails. The most common transient errors we get are in the language server unit tests, so using something like this action to run this step might be an option.

atomb avatar Apr 18 '24 18:04 atomb

To add to the previous comment, I think there may be some other improvements we could make to the language server tests.

As background, we currently run them twice to be more likely to catch flaky tests. This increases the chance of failure. Then, however, we usually just manually re-run the job when it fails, which decreases the chance of failure.

To improve on this, we could run the tests only once in the step, but allow that step to be retried a few times (maybe 3) if it fails. In any subsequent runs, we would immediately fail if any of the failed tests were added or modified in the current PR. Otherwise, we would rerun the tests normally. This would reduce CI compute use overall in the common case but more quickly detect failures that occur in tests being added and modified, all while reducing the probability that flakiness in existing tests would block a PR.

atomb avatar Apr 18 '24 18:04 atomb

To improve on this, we could run the tests only once in the step, but allow that step to be retried a few times (maybe 3) if it fails. In any subsequent runs, we would immediately fail if any of the failed tests were added or modified in the current PR.

An unmodified test can become flaky because of modifications elsewhere, which has happened several times already. I don't see a way to decide which tests are allowed to be rerun. I would prefer to run possibly flaky tests many times, like 10 or a 100 times, instead of 2.

I think the best investment is to fix flaky tests. This not only improves our build but also allows us to catch hard to find bugs. The only thing needed to fix these flaky tests is to prioritize it.

keyboardDrummer avatar Apr 19 '24 08:04 keyboardDrummer

I think I agree that trying to identify which failures we care about would be too imprecise, and too much work to be worth it. For the meantime, I've implemented support for automatically retrying those tests, since I think we'll have at least a few PRs before we manage to eliminate such failures, and it was super easy to do. But I agree that the top priority should be to fix those tests.

atomb avatar Apr 19 '24 20:04 atomb

With #5342 merged, I think we've resolved the pain point that this issue describes as well as we're going to be able to.

atomb avatar Apr 19 '24 21:04 atomb

https://github.com/dafny-lang/dafny/pull/5342 was reverted so I'm reopening this in case we want to revisit.

Also adding a somewhat-related observation: if you are trying to fix a nightly build break in a PR and have added run-deep-tests, then even after all jobs finish, it's not possible to re-run just one failed job. This is because the check-deep-tests jobs fail and will continue to fail every time until the PR is merged, and all jobs have needs: check-deep-tests so they think they have to re-run in case that result changes or any output from it.

It seems reasonable to not bother with check-deep-tests at all if run-deep-tests is present.

robin-aws avatar May 01 '24 16:05 robin-aws