ci-operator icon indicating copy to clipboard operation
ci-operator copied to clipboard

cmd/ci-operator: Add --grace-period

Open wking opened this issue 6 years ago • 12 comments

Because while two seconds may be sufficient teardown time for unit tests and such, it's too short for e2e teardown (where cluster logs should be collected, cluster resources need to be reaped, and collected assets need to be uploaded). The two-second timeout is from 2ac3abf (#16). And actually, I'm not sure how the 2-second default worked even for unit tests and such, because Kubernetes pods have a 30-second grace period by default.

A recent aborted e2e job that did not run teardown to completion or collect any cluster assets is here and here. A recent successful e2e job showing current teardown timing is here and here. From the build log:

2019/02/14 18:19:54 Container setup in pod e2e-aws completed successfully
2019/02/14 18:46:08 Container test in pod e2e-aws completed successfully
2019/02/14 18:51:38 Container teardown in pod e2e-aws completed successfully

So 5.5 minutes to teardown. And from the installer's destroy cluster logs:

time="2019-02-14T18:47:40Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"1341ccc2-66ac-46bb-ae15-0295d4a126ba\"}"
...
time="2019-02-14T18:51:38Z" level=debug msg="Purging asset \"Cluster\" from disk"

So about 4 minutes of that is resource reaping (with the previous 1.5 minutes being log collection).

I'm still not sure where we set up the ci-operator invocation, which is where we'd set the new command-line option. And if we aren't comfortable setting this to six minutes or similar for all jobs, we may want to drop --grace-period and instead load this from the job's YAML configuration. @smarterclayton, thoughts?

wking avatar Feb 15 '19 06:02 wking

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wking To fully approve this pull request, please assign additional approvers. We suggest the following additional approver: stevekuznetsov

If they are not already assigned, you can assign the PR to them by writing /assign @stevekuznetsov in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot avatar Feb 15 '19 06:02 openshift-ci-robot

I'm still not sure where we set up the ci-operator invocation, which is where we'd set the new command-line option.

ci-operator invocation is part of Prow job configuration, which is roughly 90% Prowgen-generated and 10% hand-crafted. Prowgen already has separate paths for container-based jobs and template-based jobs, so it could generate different parameters for both types. Or as you suggest, we could make this configurable in config, but then people would need to know what to set there, and it would probably end up very similar to "containers->short, template->long".

petr-muller avatar Feb 15 '19 13:02 petr-muller

/hold

Will review today, I also wanted this but there are some subtleties that can be an issue

smarterclayton avatar Feb 15 '19 15:02 smarterclayton

Is a globally long period bad?

If you use CI for fixing your bugs, you may not want to have to wait 7 minutes (or whatever we want to use to accomodate e2e) after a force-push before your aborted unit test job cleans up and you can launch the new one. I'm actually not sure how Prow handles triggering a new round after an abort. Does it hold jobs on the new tip until the old jobs have all been reaped?

wking avatar Feb 15 '19 19:02 wking

Isn't that just the grace period though, so if your test exited faster we'd move on to cleanup faster, right?

stevekuznetsov avatar Feb 16 '19 16:02 stevekuznetsov

Does it hold jobs on the new tip until the old jobs have all been reaped?

No

stevekuznetsov avatar Feb 16 '19 16:02 stevekuznetsov

Isn't that just the grace period though, so if your test exited faster we'd move on to cleanup faster, right?

No, because there's not currently a channel from the pod-deletion code back to the handler manager, so it just sleeps to give the pod-deleter the time it might need.

Does it hold jobs on the new tip until the old jobs have all been reaped?

No

How does that work with recycling jobs on the same namespace? Maybe if someone runs /test e2e-aws while there's an existing run in progress? Does that abort the earlier job?

wking avatar Feb 16 '19 16:02 wking

No, because there's not currently a channel from the pod-deletion code back to the handler manager, so it just sleeps to give the pod-deleter the time it might need.

Wow, here I am learning about how ci-operator works. Ew. Can we fix this?

How does that work with recycling jobs on the same namespace? Maybe if someone runs /test e2e-aws while there's an existing run in progress? Does that abort the earlier job?

Uh -- individual jobs will be killed, yes. But there is no interdependency between jobs. When a ProwJob is created, if a ProwJob already exists that is running for the same org/repo#pull and started earlier, we issue a pod deletion... does default k8s client do blocking deletions? If not, this might be problematic and we may have multiple pods running

stevekuznetsov avatar Feb 17 '19 01:02 stevekuznetsov

@smarterclayton this is outstanding for a review

stevekuznetsov avatar Mar 06 '19 01:03 stevekuznetsov

No need for review yet, I was going to reroll to get better abort linking between the goroutines first.

wking avatar Mar 06 '19 03:03 wking

Note that we don’t have to kill thenpod with grace period or wait for it to exit for teardown to happen.

We “fire and forget” pod deletion because the pod itself manages teardown within its grace period. It is possible for a pod being shutdown to be evicted, but it should be vanishingly rare.

smarterclayton avatar Mar 06 '19 14:03 smarterclayton

@wking: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot avatar Apr 24 '19 22:04 openshift-ci-robot