pulumi ci: Sometimes a change is merged even when there are test failures

Sometimes we've seen successful merges despite there being test failures.

For example: https://github.com/pulumi/pulumi/issues/15272

In this case, there were test failures on Windows, but the merge still happened. Then, subsequent merge attempts fail due to the failing tests.

This is the merge run: https://github.com/pulumi/pulumi/actions/runs/7672650761

The *Annotations` shows this:

But the actual check is green:

The "run tests" step doesn't show any failures: https://github.com/pulumi/pulumi/actions/runs/7672650761/job/20913866181#step:38:156

Interestingly, if I look at the "Summarize Test Times by Individual Test" step, I do see the failures listed there:

https://github.com/pulumi/pulumi/actions/runs/7672650761/job/20913866181#step:45:61

TestNewStackRemoteSourceWithSetup	github.com/pulumi/pulumi/sdk/v3/go/auto	1m0.87s	pass
TestUpsertStackRemoteSource	github.com/pulumi/pulumi/sdk/v3/go/auto	43.18s	pass
TestNewStackRemoteSource	github.com/pulumi/pulumi/sdk/v3/go/auto	33.11s	pass
TestWorkspaceSecretsProvider	github.com/pulumi/pulumi/sdk/v3/go/auto	20.22s	pass
TestRemoveWithForce	github.com/pulumi/pulumi/sdk/v3/go/auto	16.75s	pass
TestUpsertStackLocalSource	github.com/pulumi/pulumi/sdk/v3/go/auto	16.6s	pass
TestSupportsStackOutputs	github.com/pulumi/pulumi/sdk/v3/go/auto	16.34s	pass
TestNewStackLocalSource	github.com/pulumi/pulumi/sdk/v3/go/auto	15.94s	pass
TestUpdatePlans	github.com/pulumi/pulumi/sdk/v3/go/auto	12.86s	pass
TestEnvFunctions	github.com/pulumi/pulumi/sdk/v3/go/auto	12.46s	pass
TestStructuredOutput	github.com/pulumi/pulumi/sdk/v3/go/auto	12.08s	fail
TestSaveStackSettings	github.com/pulumi/pulumi/sdk/v3/go/auto	11.13s	pass
TestProgressStreams	github.com/pulumi/pulumi/sdk/v3/go/auto	10.31s	pass
TestInstallDefaultRoot	github.com/pulumi/pulumi/sdk/v3/go/auto	9.56s	fail
TestTagFunctions	github.com/pulumi/pulumi/sdk/v3/go/auto	4.31s	pass
TestWhoAmIDetailed	github.com/pulumi/pulumi/sdk/v3/go/auto	4.25s	pass
TestConfigFlagLike	github.com/pulumi/pulumi/sdk/v3/go/auto	2.88s	pass
TestNestedConfig	github.com/pulumi/pulumi/sdk/v3/go/auto	2.27s	pass
TestProjectSettingsRespected	github.com/pulumi/pulumi/sdk/v3/go/auto	1.46s	pass
TestPulumiCommand	github.com/pulumi/pulumi/sdk/v3/go/auto	350ms	pass
TestPulumiVersion	github.com/pulumi/pulumi/sdk/v3/go/auto	330ms	pass
TestByPassesRemoteCheck	github.com/pulumi/pulumi/sdk/v3/go/auto	0s	pass
TestClIWithoutRemoteSupport	github.com/pulumi/pulumi/sdk/v3/go/auto	0s	pass

Jan 26 '24 23:01 justinvp

It's entirely possible that this could be a blip with GitHub. GitHub Actions is dragging today. I've noticed it as well as others on the team. No mention of it on https://www.githubstatus.com, though.

Jan 26 '24 23:01 justinvp

This looks similar to https://github.com/pulumi/pulumi/issues/14300.

Jan 29 '24 10:01 tgummerer

Suggest we close 14300 as duplicate, this one has a bit more information on it.

Jan 29 '24 10:01 Frassle

I wanted to check the code coverage associated with some of the changes we postmortem'd, such as https://github.com/pulumi/pulumi/commit/a05c188edc3f2acd712657920be3571d1372e672. I checked CodeCov and couldn't find the commit listed there. Alarmingly, it looks like the merge workflow failed for this PR due to an intermittent failure installing Node on one of the jobs (and therefore the coverage data was not uploaded), but it still merged anyway. https://github.com/pulumi/pulumi/actions/runs/7713807240/job/21026572191

Feb 09 '24 01:02 justinvp

Apparently, skipped jobs are counted as successful.

Likely need a workaround like this: https://github.com/pulumi/pulumi-azure/blob/60738de3b1f19a8e2429332e16f32bbbd1caaec6/.github/workflows/run-acceptance-tests.yml#L258-L259

Mar 08 '24 18:03 justinvp

Looking into this, there's a few things happening in this issue that are getting a little conflated.

Alarmingly, it looks like the merge workflow failed for this PR due to an intermittent failure installing Node on one of the jobs (and therefore the coverage data was not uploaded), but it still merged anyway. https://github.com/pulumi/pulumi/actions/runs/7713807240/job/21026572191

Looking at the history here, there's a few commits that got created (iow put in the merge queue) within less than an hour:

https://github.com/pulumi/pulumi/commit/a05c188edc3f2acd712657920be3571d1372e672 (3:50pm UTC), 2) https://github.com/pulumi/pulumi/commit/afb287d2fb305531d63c18378efacedf44c9dcfc (3:53pm UTC), 3) https://github.com/pulumi/pulumi/commit/0113f27d6c39333d90042bba3cd790c5880fd6ef (4:45 UTC). Because we are doing squash merges I believe it's impossible to track down the exact time the commits got merged.

The way our merge queue is set up, it keeps PRs in the merge group even if their CI checks have failed. We can change this setting, but I don't believe this is necessary. ps_screenshot

So what happened here is that CI failed in this PR, but CI for a commit that was put in the merge group later, 3), succeeded while the first commit was still part of the merge group. Since CI was now green, the merge queue went ahead and merged the whole group. The reason why I think this setting is fine is that we care about the latest main branch being always green. With less flakes we could turn the setting on and make sure only PRs that are all green in the merge queue are going through. Right now I think that would be too expensive, but I could be convinced otherwise.

This is the merge run: https://github.com/pulumi/pulumi/actions/runs/7672650761

The *Annotations` shows this:

This is also fairly easily explained, we explicitly tell GitHub actions to not fail here: https://github.com/pulumi/pulumi/blob/master/.github/workflows/ci-run-test.yml#L376. This change was introduced in https://github.com/pulumi/pulumi/pull/10753, unfortunately without much explanation, but presumably this was fairly flaky, and still is. Spot checking the last 5 merge jobs, this step failed twice. If we want to make this an actual failure we'll need to try to de-flake this. However I'm not convinced that this step has enough value to spend a lot of engineering time on, since all it does is giving a summary of test times (Personal bias coming in here, but I've never looked at this). If people look at this fairly regularly we should definitely try to de-flake it, but making it an error while all the tests passed seems like it's not worth it.

What's very curious though is this:

Interestingly, if I look at the "Summarize Test Times by Individual Test" step, I do see the failures listed there:

https://github.com/pulumi/pulumi/actions/runs/7672650761/job/20913866181#step:45:61

There seem to be failed tests coming from somewhere, where we didn't notice the failures earlier. To me this looks like the crux of the issue, which also reminds me of https://github.com/pulumi/pulumi/issues/14300, where I believe I saw something similar.

Mar 15 '24 13:03 tgummerer

So what happened here is that CI failed in this PR, but CI for a commit that was put in the merge group later, 3), succeeded while the first commit was still part of the merge group. Since CI was now green, the merge queue went ahead and merged the whole group. The reason why I think this setting is fine is that we care about the latest main branch being always green.

Removing P1 label since this isn't as bad as we initially thought.

Mar 20 '24 22:03 justinvp

pulumi pulumi copied to clipboard

ci: Sometimes a change is merged even when there are test failures

pulumi
pulumi copied to clipboard