pulumi
pulumi copied to clipboard
ci: Sometimes a change is merged even when there are test failures
Sometimes we've seen successful merges despite there being test failures.
For example: https://github.com/pulumi/pulumi/issues/15272
In this case, there were test failures on Windows, but the merge still happened. Then, subsequent merge attempts fail due to the failing tests.
This is the merge run: https://github.com/pulumi/pulumi/actions/runs/7672650761
The *Annotations` shows this:
But the actual check is green:
The "run tests" step doesn't show any failures: https://github.com/pulumi/pulumi/actions/runs/7672650761/job/20913866181#step:38:156
Interestingly, if I look at the "Summarize Test Times by Individual Test" step, I do see the failures listed there:
https://github.com/pulumi/pulumi/actions/runs/7672650761/job/20913866181#step:45:61
TestNewStackRemoteSourceWithSetup github.com/pulumi/pulumi/sdk/v3/go/auto 1m0.87s pass
TestUpsertStackRemoteSource github.com/pulumi/pulumi/sdk/v3/go/auto 43.18s pass
TestNewStackRemoteSource github.com/pulumi/pulumi/sdk/v3/go/auto 33.11s pass
TestWorkspaceSecretsProvider github.com/pulumi/pulumi/sdk/v3/go/auto 20.22s pass
TestRemoveWithForce github.com/pulumi/pulumi/sdk/v3/go/auto 16.75s pass
TestUpsertStackLocalSource github.com/pulumi/pulumi/sdk/v3/go/auto 16.6s pass
TestSupportsStackOutputs github.com/pulumi/pulumi/sdk/v3/go/auto 16.34s pass
TestNewStackLocalSource github.com/pulumi/pulumi/sdk/v3/go/auto 15.94s pass
TestUpdatePlans github.com/pulumi/pulumi/sdk/v3/go/auto 12.86s pass
TestEnvFunctions github.com/pulumi/pulumi/sdk/v3/go/auto 12.46s pass
TestStructuredOutput github.com/pulumi/pulumi/sdk/v3/go/auto 12.08s fail
TestSaveStackSettings github.com/pulumi/pulumi/sdk/v3/go/auto 11.13s pass
TestProgressStreams github.com/pulumi/pulumi/sdk/v3/go/auto 10.31s pass
TestInstallDefaultRoot github.com/pulumi/pulumi/sdk/v3/go/auto 9.56s fail
TestTagFunctions github.com/pulumi/pulumi/sdk/v3/go/auto 4.31s pass
TestWhoAmIDetailed github.com/pulumi/pulumi/sdk/v3/go/auto 4.25s pass
TestConfigFlagLike github.com/pulumi/pulumi/sdk/v3/go/auto 2.88s pass
TestNestedConfig github.com/pulumi/pulumi/sdk/v3/go/auto 2.27s pass
TestProjectSettingsRespected github.com/pulumi/pulumi/sdk/v3/go/auto 1.46s pass
TestPulumiCommand github.com/pulumi/pulumi/sdk/v3/go/auto 350ms pass
TestPulumiVersion github.com/pulumi/pulumi/sdk/v3/go/auto 330ms pass
TestByPassesRemoteCheck github.com/pulumi/pulumi/sdk/v3/go/auto 0s pass
TestClIWithoutRemoteSupport github.com/pulumi/pulumi/sdk/v3/go/auto 0s pass
It's entirely possible that this could be a blip with GitHub. GitHub Actions is dragging today. I've noticed it as well as others on the team. No mention of it on https://www.githubstatus.com, though.
This looks similar to https://github.com/pulumi/pulumi/issues/14300.
Suggest we close 14300 as duplicate, this one has a bit more information on it.
I wanted to check the code coverage associated with some of the changes we postmortem'd, such as https://github.com/pulumi/pulumi/commit/a05c188edc3f2acd712657920be3571d1372e672. I checked CodeCov and couldn't find the commit listed there. Alarmingly, it looks like the merge workflow failed for this PR due to an intermittent failure installing Node on one of the jobs (and therefore the coverage data was not uploaded), but it still merged anyway. https://github.com/pulumi/pulumi/actions/runs/7713807240/job/21026572191
Apparently, skipped jobs are counted as successful.
Likely need a workaround like this: https://github.com/pulumi/pulumi-azure/blob/60738de3b1f19a8e2429332e16f32bbbd1caaec6/.github/workflows/run-acceptance-tests.yml#L258-L259
Looking into this, there's a few things happening in this issue that are getting a little conflated.
Alarmingly, it looks like the merge workflow failed for this PR due to an intermittent failure installing Node on one of the jobs (and therefore the coverage data was not uploaded), but it still merged anyway. https://github.com/pulumi/pulumi/actions/runs/7713807240/job/21026572191
Looking at the history here, there's a few commits that got created (iow put in the merge queue) within less than an hour:
- https://github.com/pulumi/pulumi/commit/a05c188edc3f2acd712657920be3571d1372e672 (3:50pm UTC), 2) https://github.com/pulumi/pulumi/commit/afb287d2fb305531d63c18378efacedf44c9dcfc (3:53pm UTC), 3) https://github.com/pulumi/pulumi/commit/0113f27d6c39333d90042bba3cd790c5880fd6ef (4:45 UTC). Because we are doing squash merges I believe it's impossible to track down the exact time the commits got merged.
The way our merge queue is set up, it keeps PRs in the merge group even if their CI checks have failed. We can change this setting, but I don't believe this is necessary.
So what happened here is that CI failed in this PR, but CI for a commit that was put in the merge group later, 3), succeeded while the first commit was still part of the merge group. Since CI was now green, the merge queue went ahead and merged the whole group. The reason why I think this setting is fine is that we care about the latest main branch being always green. With less flakes we could turn the setting on and make sure only PRs that are all green in the merge queue are going through. Right now I think that would be too expensive, but I could be convinced otherwise.
This is the merge run: https://github.com/pulumi/pulumi/actions/runs/7672650761
The *Annotations` shows this:
This is also fairly easily explained, we explicitly tell GitHub actions to not fail here: https://github.com/pulumi/pulumi/blob/master/.github/workflows/ci-run-test.yml#L376. This change was introduced in https://github.com/pulumi/pulumi/pull/10753, unfortunately without much explanation, but presumably this was fairly flaky, and still is. Spot checking the last 5 merge jobs, this step failed twice. If we want to make this an actual failure we'll need to try to de-flake this. However I'm not convinced that this step has enough value to spend a lot of engineering time on, since all it does is giving a summary of test times (Personal bias coming in here, but I've never looked at this). If people look at this fairly regularly we should definitely try to de-flake it, but making it an error while all the tests passed seems like it's not worth it.
What's very curious though is this:
Interestingly, if I look at the "Summarize Test Times by Individual Test" step, I do see the failures listed there:
https://github.com/pulumi/pulumi/actions/runs/7672650761/job/20913866181#step:45:61
There seem to be failed tests coming from somewhere, where we didn't notice the failures earlier. To me this looks like the crux of the issue, which also reminds me of https://github.com/pulumi/pulumi/issues/14300, where I believe I saw something similar.
So what happened here is that CI failed in this PR, but CI for a commit that was put in the merge group later, 3), succeeded while the first commit was still part of the merge group. Since CI was now green, the merge queue went ahead and merged the whole group. The reason why I think this setting is fine is that we care about the latest
mainbranch being always green.
Removing P1 label since this isn't as bad as we initially thought.
