community Document the `force-skip-ci` usage

A force-skip-ci label has been added a long time ago in order to allow folks to by-pass the CI for one reason or another. However, there's the concern of having this misused (not due to bad intention, but rather due to a mistake), which brings up the need of a written policy to use it.

We can also extend this issue to discuss whether having such label is valid or not.

/cc @kata-containers/architecture-committee

Jan 20 '22 17:01 fidencio

I wasn't aware of the flag, so indeed we should setup documentation around it (or I should be more involved and pay attention!).

I worry that this is a very big hammer that is available to too many people without any process defined currently.

In the past, we've done similar, maybe more terrible, things by editing the required jobs associated with a repository, based on the repository's settings.

I would hope, but not necessarily expect, that the list of people who can apply a label to a PR/issue is a much larger superset of the people who have administrative rights to modify repository settings in GitHub. If that is the case, I'd prefer we define a process, and rely on changing the settings rather than using a flag. If that is not the case, I want to see what we can do to make it the case.

Jan 20 '22 17:01 egernst

I am afraid that's not the case, @egernst , and I think it deserves yet another issue (and a nice discussion) on who are the folks who should have "admin" rights to the repos from now on, and revisit everyone's access based on that decision.

Jan 20 '22 18:01 fidencio

Still about this one, even setting the force-skip-ci we still need the approval of two other maintainers. This part is not skipped. So, yes, this is a big hammer, but this is a big hammer that at least 3 maintainers should agree on using it and any other reviewer can block it.

Jan 20 '22 20:01 fidencio

My other concern is the consequence of using the flag. IMO we need the flag because we want to bypass some tests in order to merge a PR that cannot pass CI but is not supposed to break anything in CI after being merged. So the CI jobs that we can bypass must fall into the category that it ONLY check something that belongs to the targeting PR and that something will not be checked after the PR is merged.

So checking the commit message format is one example of such CI jobs that can safely be skipped. But in the case of force-skip-ci flag, we are skipping much more than that. E.g., the static checker is skipped in kata-containers/kata-containers#3468 and that's why it broke the CI after it was merged.

So my suggestion would be, let's remove force-skip-ci, and add something that is really needed but nothing more, skip-commit-message-check. I am not sure if there are other types of such safe jobs to skip, but we can add things incrementally.

Jan 21 '22 03:01 bergwolf

E.g., the static checker is skipped in kata-containers/kata-containers#3468 and that's why it broke the CI after it was merged.

I disagree here, @bergwolf.

So, let's take a step back and evaluate what actually happened with kata-containers/kata-containers#3468.

I raised a PR that I knew that would break the CI.

I advertised that in the first message of the PR.

We'll need further commits to actually start using govmm from the kata-containers repo, but those will come later as this commit will, most likely, have to be merged with a force-skip-ci for now.

I worked on solving the static-checks issues before having the PR merged

I am adding the do-not-merge back, in order to ensure I can test things on my fork first (there's a short series making static checks and other things happy that depends on this one).

After that I had everything working on my fork, I proceeded with merging https://github.com/kata-containers/kata-containers/pull/3468
A few minutes after having https://github.com/kata-containers/kata-containers/pull/3468 merged, I raised https://github.com/kata-containers/kata-containers/pull/3496, which included the fixes.
- There I found out issues that I was NOT able to catch from the static-checks, which were architecture specific issues, and I worked to solve them (as in, report and work them around) before the end of my day, and I only finished my day being sure CI would NOT be blocked.
A NON related failure happens on one of the CIs
Absolutely no-one took the time to check whether the failure was related or not
The CI had to wait for me to come back, 8 hours later, to re-run the job with the NON related failure
Situation got fixed

Looking at what happened I don't think the CI was broken for 12 to 18 hours because of the force-skip-ci, although I think we should discuss its existence. Looking at what happened I think the CI was broken for 12 to 18 hours because of a combination of:

Not having a way to test on different architectures locally, which is quite hard to solve by the way.
Not having a stable enough CI (so we could avoid the non related failures)
Not having folks engaged enough to actually look at the failures and re-run a CI

I'm all in to solve all those issues, but the third one has to come from each contributor.

Jan 21 '22 06:01 fidencio

And just for the archaeological sake, https://github.com/kata-containers/tests/pull/2915 is when this was introduced.

Jan 21 '22 11:01 fidencio

@bergwolf and I had a call Today, where we discussed this topic and here's pretty much the summary of the discussion.

We should, instead of having the force-skip-ci label, have ways to bypass some specific checks that break the current CI status, but that will NOT leave the CI broken after that. However, in order to better identify and act on those we need to do some work beforehand, as:

Break the action used for static-checks into smaller steps.
- This will require changes on how we call the static-check, but that's reasonably fine as the command already takes arguments for each one of the checks it runs (see: https://github.com/kata-containers/tests/blob/003f84394ba8a3a18c19e15cb60ff1e8ce39f6aa/.ci/static-checks.sh#L1348-L1371).
Evalute from each one of the new pieces could have a "skip" label (for urgent cases).
Once we have it in place, drop the force-skip-ci label, in favour of its smaller and self-contained new parts.

Meanwhile, we should really re-visit who has permissions to do what in our repo and also get an agreement on https://github.com/kata-containers/community/issues/249.

This is just a summary of what's been discussed and issues will be opened later on.

Jan 21 '22 12:01 fidencio

Firstly, I'd better own up as the author of the force-skip-ci feature.

Once we have it in place, drop the force-skip-ci label, in favour of its smaller and self-contained new parts.

I'm not necessarily against this, but sometimes we just need a way to bypass the CI (almost) entirely, and quickly. We could just disable all the PR checks for the repo in the GH settings, but that is not something we would actually want to do.

I like the idea of more granular skipping, but the worst case might be to specify quite a large bunch of magic labels / comments to skip the CI which sounds a bit awkward (and slow).

Maybe we could retain the force-skip-ci feature "for emergencies", but only allow it for named users (aka the AC members plus CI owners?) That way we'd have a small number of users with "CI super user" abilities".

This sounds like a topic for the AC to discuss.

Jan 21 '22 16:01 jodh-intel

And once we've got an agreed plan for this, it would be great to have commitment from all interested parties to help implement, test, document and review the changes. And help review the existing user rights on the two main repos.

Jan 21 '22 16:01 jodh-intel

Just deleted my last comment as I refreshed the page and saw your last comment

@fidencio To add more details about this one (which was also part of our discussion, and agreement if I understand correctly)

Evalute from each one of the new pieces could have a "skip" label (for urgent cases).

The standard of whether a piece can be skipped or not, is that

It should not have impact on the CIs for PRs after the targeting PR is merged

So it narrows the candidate jobs to be almost only the commit message check, and the github issus mover. And it won't result in a situation that the worst case might be to specify quite a large bunch of magic labels / comments to skip the CI which sounds a bit awkward (and slow). as being raised by @jodh-intel

Jan 24 '22 02:01 bergwolf

@jodh-intel

sometimes we just need a way to bypass the CI (almost) entirely, and quickly

Yes we do. And I agree with @egernst that we need a process to handle it through GH settings instead of trying to solve it with labels.

Jan 24 '22 03:01 bergwolf

The commit message check is already done in a separate action. So this looks much easier to solve:

remove force-skip-ci label
add skip-commit-message-check label
skip commit message checker when skip-commit-message-check label is set (which depends on force-skip-ci right now)

There is also checkcommits in the tests repo. IMO we can just rely on the GH action for commit message check, no need to check twice.

Jan 24 '22 03:01 bergwolf

@fidencio @jodh-intel @egernst Looking at .github/workflows in the kata-containers repo, we have these actions respecting force-skip-ci:

PR-wip-checks.yaml
commit-message-check.yaml
kata-deploy-push.yaml
move-issues-to-in-progress.yaml
require-pr-porting-labels.yaml
snap.yaml
static-checks.yaml

Among them, kata-deploy-push.yaml, snap.yaml and static-checks.yaml will impact other PRs if they are skipped. So how about removing force-skip-ci from these three, and replace it with skip-side-effectless-ci in all the other actions? This will be much easier to implement than splitting static-check.sh, which is a good target btw, but may not be a MUST-have to fix force-skip-ci.

Jenkins jobs also check for the force-skip-ci label and skip all tests. I think it should be dropped too there as well. WDYT?

Jan 24 '22 04:01 bergwolf

community community copied to clipboard

Document the `force-skip-ci` usage

community
community copied to clipboard