node icon indicating copy to clipboard operation
node copied to clipboard

add `resume-ci` label

Open MoLow opened this issue 1 year ago • 20 comments

fixes https://github.com/nodejs/node/issues/40817 awaiting merge of https://github.com/nodejs/node-core-utils/pull/642

MoLow avatar Jul 21 '22 12:07 MoLow

Review requested:

  • [ ] @nodejs/actions
  • [ ] @nodejs/tsc

nodejs-github-bot avatar Jul 21 '22 12:07 nodejs-github-bot

I'm very reluctant about adding this because I think it's a mistake to resume CI without looking at the failures, and if you are looking at the failures you can just click on the Jenkins button.

targos avatar Jul 21 '22 12:07 targos

I'm very reluctant about adding this because I think it's a mistake to resume CI without looking at the failures, and if you are looking at the failures you can just click on the Jenkins button.

Unless you are not a collaborator (e.g. a trigger), in which case you cannot resume using Jenkins CI. I agree with the sentiment though.

aduh95 avatar Jul 21 '22 12:07 aduh95

I agree, CI should only be resumed if a human has made sure the specific flakiness existed before that PR - this is a tool that can help triaggers do something they currently cannot do

this was already discussed in the original issue https://github.com/nodejs/node/issues/40817#issuecomment-1058220803

MoLow avatar Jul 21 '22 12:07 MoLow

also, as for my understanding, TSC has addressed (part) of this discussion as well? https://github.com/nodejs/node/issues/42125#issuecomment-1065803913 https://github.com/mhdawson/TSC/blob/a52f5bb892d986e470661c15635d79d384302dd1/meetings/2022-03-10.md#nodejsnode

MoLow avatar Jul 21 '22 12:07 MoLow

Was it considered to allow triagers to access the Jenkins feature?

targos avatar Jul 21 '22 12:07 targos

Was it considered to allow triagers to access the Jenkins feature?

IIRC that is the intention here: not to make resuming CIs simpler in general, but as a workaround to give this particular permission to triagers.

tniessen avatar Jul 21 '22 13:07 tniessen

I mean, do we really need a workaround, rather than giving them the permission inside Jenkins?

targos avatar Jul 21 '22 13:07 targos

I mean, do we really need a workaround, rather than giving them the permission inside Jenkins?

If that is possible that sounds like a better solution to me

MoLow avatar Jul 21 '22 13:07 MoLow

I'm very reluctant about adding this because I think it's a mistake to resume CI without looking at the failures, and if you are looking at the failures you can just click on the Jenkins button.

If I didn't need to resume CI repeatedly for every PR, I would sympathize with this. But CI is much too flaky for looking at errors to be worth my time until after I've resumed at least three, maybe five times. It should resume by default and stop after 3-5 attempts.

GeoffreyBooth avatar Jul 21 '22 15:07 GeoffreyBooth

But CI is much too flaky for looking at errors to be worth my time until after I've resumed at least three, maybe five times. It should resume by default and stop after 3-5 attempts.

That is exactly how I assume #43522 was merged, and it has made CI much, much worse for all collaborators, to the point where it was virtually unusable for days.

It should resume by default and stop after 3-5 attempts.

Resuming CI without looking at errors makes it more likely to miss related failures and thus to introduce new flaky tests, which makes the situation worse for everyone, beyond that PR.

Let's say a PR introduces a test that flakes 50 % of the time. Running one CI and checking for errors gives you a 50 % chance of catching it. Resuming CI without checking what errors occurred reduces the chances of catching the flaky test exponentially!

I can't find it right now, but there was a PR a while ago that implemented something like this (i.e., automated resuming or something similar), which I was against for the same reason.


There are other approaches that might be worth investigating. For example, the test runner could, when a test fails, re-run it n times and count how many times it fails. This can be used to estimate whether a test flaked or whether it failed deterministically, but CI should still fail. If a test is marked as flaky, regardless of whether it passes, the test runner could run it n times and see if at least one run passes, to make sure it really is flaky and not failing every time. (But this would need some experimentation.)

tniessen avatar Jul 23 '22 10:07 tniessen

The problem is more profound: it's virtually impossible to get a green CI without resuming. I currently have 7 PRs that I'm restarting CI every day.

Something that would be extremely helpful is to get the list of failed tests as a PR comment.

We could have a different approach:

  1. mark all tests that fails at least once a day as flaky
  2. only run a reduced set of very reliable tests on the platforms that are more likely to fail (windows, arm, smartos, aix). This can be based on the support tier

mcollina avatar Jul 23 '22 11:07 mcollina

I completely agree @mcollina, we are in a tough spot right now. All I am saying is that resuming without properly checking for errors only makes things worse.

mark all tests that fails at least once a day as flaky

Big +1 as long as we treat it as an urgent TODO list. (Essentially what I wrote in https://github.com/nodejs/node/pull/43754#issuecomment-1180108835.)

tniessen avatar Jul 23 '22 11:07 tniessen

I think that this PR would be a great change. But I also think that request-ci should always do the right thing: if there are no commits since the last CI, it should retry the CI. If there is an intervening commit, it should start a new build. People shouldn't need to think about which is correct. nodejs/node#44130 clears up some of the documentation here but I think it's hard to solve UX issues with documentation.

(signed, an idiot who didn't realize request and retry were different, and so requested too many builds)

kvakil avatar Aug 05 '22 03:08 kvakil

@kvakil that sounds a great improvement! but:

  1. I feel this PR is currently too controversial to land, without addressing some of the feedback. Perhaps if we add your suggestion with a limit that after 3 times, it will comment on the PR - stating that "resume must be done manually after three failures." WDYT?
  2. The PR in ncu needs an approval first. It seems to me like it is ok merging it since running a cli command won't be abused easily as a label, can someone approve and merge it? CC @nodejs/node-core-utils

MoLow avatar Aug 05 '22 05:08 MoLow

In my mind resume wouldn't happen automatically, the author would still need to go back and retag the PR with request-ci in order to resume it. There just wouldn't be a separate resume-ci label: ideally the tooling can detect if the CI needs to be rebuilt entirely (if there were additional commits since the last CI) or if it can just be resumed (if there have been no additional commits).

& to be clear this is just a wish, I definitely don't think it should stop us from using this PR. I am not sure how hard the implementation would be. I just think the user experience would be better.

kvakil avatar Aug 05 '22 05:08 kvakil

@kvakil adding a request-ci label is still too automatic in the sence that CI failures should not be ignored, and need minimal inspection. See https://github.com/nodejs/node/pull/43929#issuecomment-1191419576

MoLow avatar Aug 05 '22 06:08 MoLow

I just think the user experience would be better.

Currently the user experience is to go to the Jenkins Web UI to check what are the failures, and if it turns out the failures are indeed unrelated to the PR to test, the "Resume build" button is right there on Jenkins UI, there's no reason to go back to GitHub to add a label. What I'm trying to say is this label is not meant to improve the UX (it won't), it's to enable triagers to resume CIs, which they currently can't.

People shouldn't need to think about which is correct.

That's exactly what's controversial about this PR: some are concerned that adding this label would make collaborators/triagers less likely to think about the CI failures and instead re-apply the label without checking the failures until they get a passing CI – in particular, it would enable the landing of PRs that introduce flaky tests, making the CI even less reliable than they currently are.

  1. Perhaps if we add your suggestion with a limit that after 3 times, it will comment on the PR - stating that "resume must be done manually after three failures." WDYT?

That'd be a very nice feature indeed! I would go further: only accept 1 request-ci and one resume-ci when no new commits have landed; with clear error messages it could help educate people on the specificities of our CI system, and it addresses the concern that folks could use the system to land flaky PRs. (Lately you rarely need more than one resume to get a passing CI anyway).

aduh95 avatar Aug 05 '22 08:08 aduh95

That'd be a very nice feature indeed! I would go further: only accept 1 request-ci and one resume-ci when no new commits have landed; with clear error messages it could help educate people on the specificities of our CI system, and it addresses the concern that folks could use the system to land flaky PRs. (Lately you rarely need more than one resume to get a passing CI anyway).

yes, I will probably wait for my nomination to complete so my jenkins token will actually work :)

MoLow avatar Aug 05 '22 08:08 MoLow

just think the user experience would be better.

Currently the user experience is to go to the Jenkins Web UI to check what are the failures, and if it turns out the failures are indeed unrelated to the PR to test, the "Resume build" button is right there on Jenkins UI, there's no reason to go back to GitHub to add a label. What I'm trying to say is this label is not meant to improve the UX (it won't), it's to enable triagers to resume CIs, which they currently can't.

The particular UX I dislike here is having resume-ci and request-ci which to me sound very similar. I think I understand the concerns around having one label better now, thanks for elaborating.

kvakil avatar Aug 05 '22 09:08 kvakil