ci-operator icon indicating copy to clipboard operation
ci-operator copied to clipboard

template: wait for pod to teardown (if container is present) during delete

Open vrutkovs opened this issue 5 years ago • 14 comments

Pod teardown may take longer than 5 mins (default ci-operator timeout). This commit would ensure the same timeout is applied to the teardown container - and then applied to the pod again.

This is useful for rehearse jobs, which reuse the namespace when testing a new commit.

TODO:

  • [ ] Add tests
  • [x] Avoid using two watchers? Extend timeout to 10 mins if there is a teardown container?

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1707486

vrutkovs avatar May 27 '19 14:05 vrutkovs

Created a better version of this:

  1. added a few helper functions
  2. Before the loop wait for teardown to complete (if this container exists in the pod)
  3. Wait for pod to be deleted

I didn't get a chance to test this yet

vrutkovs avatar May 29 '19 17:05 vrutkovs

what problem does this actually solve? If we wait for a pod to be deleted, what good is waiting for its containers to terminate?

ci-operator would wait for 300 secs only. If teardown didn't finish by that time the pod would be removed: leftover artifacts (usually Route53 records) would remain and cause issues on next retest

vrutkovs avatar Jun 03 '19 08:06 vrutkovs

/hold

I can't come up with a way to test this yet

vrutkovs avatar Jun 04 '19 09:06 vrutkovs

/cc @stevekuznetsov

vrutkovs avatar Jun 07 '19 23:06 vrutkovs

Why are we making this change?

When test gets cancelled - a new commit pushed in rehearse tests for instance - ci-operator would send termination signal and wait for pod to be gone for 5 mins. In most install tests teardown + artifacts take longer than 5 mins.

This change would wait longer if pod has teardown container - so that it won't affect unit tests / release jobs etc. It would first wait 5 mins for teardown to complete and then 5 mins for pod to disappear (in most of the installs teardown completes in 5-7 minutes).

See also https://github.com/openshift/ci-operator/pull/353#issuecomment-498171087

vrutkovs avatar Jun 08 '19 08:06 vrutkovs

OK, makes sense. Why not start a watch?

stevekuznetsov avatar Jun 08 '19 15:06 stevekuznetsov

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vrutkovs To complete the pull request process, please assign stevekuznetsov You can assign the PR to them by writing /assign @stevekuznetsov in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot avatar Jun 10 '19 09:06 openshift-ci-robot

Reworked this to leverage polls and watches:

  • waitForTeardownToComplete polls for a list of containers
  • waitForPodDeletion watches for pod to be removed with a timeout

vrutkovs avatar Jun 10 '19 09:06 vrutkovs

LGTM, let's give @stevekuznetsov a chance to review

petr-muller avatar Jun 10 '19 13:06 petr-muller

Simplified this:

  • If pod has teardown container timeout is extended by 10 mins (average teardown takes 10 min on AWS, but might take longer on openstack)
  • waitForPodDeletion now uses a watcher instead of polling. Should it also print the message when teardown has completed?

vrutkovs avatar Jun 11 '19 11:06 vrutkovs

That would hide potential issues in teardown

Which? Before we spend more time working on an implementation can we determine why the (stupid simple) approach of doing more retries over a 10, 20, 30 minute period would not be appropriate? Do we have some SLA for teardown time?

stevekuznetsov avatar Jun 12 '19 16:06 stevekuznetsov

Extending the timeout is the simplest approach, and its valid, however it would apply to all ci-operator pods.

e2e-aws's teardown is the only one I know of which takes longer than 5 mins at the moment, other types of tests may rely on existing timeout.

This PR is just one possible way, of course. If it looks overcomplicated then lets just bump the timeout on teardown to fix rehearse failures at least

vrutkovs avatar Jun 12 '19 17:06 vrutkovs

Of course it would hit all pods, but we poll every 2 seconds right now, so the only cases where increasing the timeout would actually increase the time taken for the test to run is if the pod is not returning in the current timeout, and then it would only increase it by the time taken to finish tearing down, right?

stevekuznetsov avatar Jun 12 '19 17:06 stevekuznetsov

Right, it appears a larger timeout would act the same. Created #358 instead.

I'll keep this open for now

vrutkovs avatar Jun 12 '19 19:06 vrutkovs