dafny
dafny copied to clipboard
Organization of CI workflow impedes restarting individual jobs
Dafny version
4.1.0+dev
Code to produce this issue
No response
Command to run and resulting output
No response
What happened?
One unfortunate side-effect of adding the coverage reporting at the end of the main workflow is that now all those integration and xunit tests are in one workflow, and so if one fails (with a transient error), that job can't be restarted until all the jobs are complete.
What type of operating system are you experiencing the problem on?
Other
From what I can tell, there doesn't seem to be a way to both 1) allow a job to depend on more than one prior job, and 2) allow one of those prior jobs to be restarted before the other finishes. Given that we really want to be able to merge unit and integration test coverage results, I think we may need to live with the current status unless GitHub implements something allowing more fine-grained control of jobs.
Something we could use in the meantime, however, is something that will automatically re-run a particular step in a job if it fails. The most common transient errors we get are in the language server unit tests, so using something like this action to run this step might be an option.
To add to the previous comment, I think there may be some other improvements we could make to the language server tests.
As background, we currently run them twice to be more likely to catch flaky tests. This increases the chance of failure. Then, however, we usually just manually re-run the job when it fails, which decreases the chance of failure.
To improve on this, we could run the tests only once in the step, but allow that step to be retried a few times (maybe 3) if it fails. In any subsequent runs, we would immediately fail if any of the failed tests were added or modified in the current PR. Otherwise, we would rerun the tests normally. This would reduce CI compute use overall in the common case but more quickly detect failures that occur in tests being added and modified, all while reducing the probability that flakiness in existing tests would block a PR.
To improve on this, we could run the tests only once in the step, but allow that step to be retried a few times (maybe 3) if it fails. In any subsequent runs, we would immediately fail if any of the failed tests were added or modified in the current PR.
An unmodified test can become flaky because of modifications elsewhere, which has happened several times already. I don't see a way to decide which tests are allowed to be rerun. I would prefer to run possibly flaky tests many times, like 10 or a 100 times, instead of 2.
I think the best investment is to fix flaky tests. This not only improves our build but also allows us to catch hard to find bugs. The only thing needed to fix these flaky tests is to prioritize it.
I think I agree that trying to identify which failures we care about would be too imprecise, and too much work to be worth it. For the meantime, I've implemented support for automatically retrying those tests, since I think we'll have at least a few PRs before we manage to eliminate such failures, and it was super easy to do. But I agree that the top priority should be to fix those tests.
With #5342 merged, I think we've resolved the pain point that this issue describes as well as we're going to be able to.
https://github.com/dafny-lang/dafny/pull/5342 was reverted so I'm reopening this in case we want to revisit.
Also adding a somewhat-related observation: if you are trying to fix a nightly build break in a PR and have added run-deep-tests
, then even after all jobs finish, it's not possible to re-run just one failed job. This is because the check-deep-tests
jobs fail and will continue to fail every time until the PR is merged, and all jobs have needs: check-deep-tests
so they think they have to re-run in case that result changes or any output from it.
It seems reasonable to not bother with check-deep-tests
at all if run-deep-tests
is present.