ci-operator
ci-operator copied to clipboard
template: wait for pod to teardown (if container is present) during delete
Pod teardown may take longer than 5 mins (default ci-operator timeout). This commit would ensure the same timeout is applied to the teardown container - and then applied to the pod again.
This is useful for rehearse jobs, which reuse the namespace when testing a new commit.
TODO:
- [ ] Add tests
- [x] Avoid using two watchers? Extend timeout to 10 mins if there is a teardown container?
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1707486
Created a better version of this:
- added a few helper functions
- Before the loop wait for teardown to complete (if this container exists in the pod)
- Wait for pod to be deleted
I didn't get a chance to test this yet
what problem does this actually solve? If we wait for a pod to be deleted, what good is waiting for its containers to terminate?
ci-operator would wait for 300 secs only. If teardown didn't finish by that time the pod would be removed: leftover artifacts (usually Route53 records) would remain and cause issues on next retest
/hold
I can't come up with a way to test this yet
/cc @stevekuznetsov
Why are we making this change?
When test gets cancelled - a new commit pushed in rehearse tests for instance - ci-operator would send termination signal and wait for pod to be gone for 5 mins. In most install tests teardown + artifacts take longer than 5 mins.
This change would wait longer if pod has teardown
container - so that it won't affect unit tests / release jobs etc. It would first wait 5 mins for teardown to complete and then 5 mins for pod to disappear (in most of the installs teardown completes in 5-7 minutes).
See also https://github.com/openshift/ci-operator/pull/353#issuecomment-498171087
OK, makes sense. Why not start a watch?
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: vrutkovs
To complete the pull request process, please assign stevekuznetsov
You can assign the PR to them by writing /assign @stevekuznetsov
in a comment when ready.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
Reworked this to leverage polls and watches:
-
waitForTeardownToComplete
polls for a list of containers -
waitForPodDeletion
watches for pod to be removed with a timeout
LGTM, let's give @stevekuznetsov a chance to review
Simplified this:
- If pod has teardown container timeout is extended by 10 mins (average teardown takes 10 min on AWS, but might take longer on openstack)
-
waitForPodDeletion
now uses a watcher instead of polling. Should it also print the message when teardown has completed?
That would hide potential issues in teardown
Which? Before we spend more time working on an implementation can we determine why the (stupid simple) approach of doing more retries over a 10, 20, 30 minute period would not be appropriate? Do we have some SLA for teardown time?
Extending the timeout is the simplest approach, and its valid, however it would apply to all ci-operator pods.
e2e-aws's teardown is the only one I know of which takes longer than 5 mins at the moment, other types of tests may rely on existing timeout.
This PR is just one possible way, of course. If it looks overcomplicated then lets just bump the timeout on teardown to fix rehearse failures at least
Of course it would hit all pods, but we poll every 2 seconds right now, so the only cases where increasing the timeout would actually increase the time taken for the test to run is if the pod is not returning in the current timeout, and then it would only increase it by the time taken to finish tearing down, right?
Right, it appears a larger timeout would act the same. Created #358 instead.
I'll keep this open for now