operator-lifecycle-manager icon indicating copy to clipboard operation
operator-lifecycle-manager copied to clipboard

🌱 (e2e): test: skip terminating pods in catalog image update test

Open camilamacedo86 opened this issue 1 month ago • 5 comments

Description

Fix flaky e2e test [CatalogSource] image update that fails when terminating pods are present during catalog source rollouts.

Problem

The test was failing with:

unexpected number of registry pods found
Expected <[]v1.Pod | len:2> to have length 1

During catalog source image updates, there can be 2 pods temporarily:

  • 1 old pod being deleted (with DeletionTimestamp set)
  • 1 new pod starting up

The test was counting terminating pods and failing instead of focusing on the actual requirement: verifying the catalog image was updated.

Solution

Skip pods with DeletionTimestamp != nil when checking for the updated image.

This makes the test resilient to transient states during pod rollouts while still verifying the core requirement.

Changes

  • Add check to skip terminating pods in podCheckFunc (3 lines)
  • Test now focuses on: "Does an active pod exist with the new image?"

Testing

  • Fixes downstream failure: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_operator-framework-olm/1142/pull-ci-openshift-operator-framework-olm-main-e2e-gcp-olm/1987692171796418560

camilamacedo86 avatar Nov 10 '25 07:11 camilamacedo86

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign kevinrizza for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Nov 10 '25 07:11 openshift-ci[bot]

Extra checks never hurt.

Taht is not true See; Fixes downstream failure: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_operator-framework-olm/1142/pull-ci-openshift-operator-framework-olm-main-e2e-gcp-olm/1987692171796418560

We are checking if has only 1 pod. However, the downstream tests runs in paralallel that is why we can have more than 1 pod.

My ideal solution here would be to not remove these checks in this PR, observe if we're still getting flaky results, and then look into why we are (possibly) "still failing".

By looking at the failure we can know the error already.

My main concern is that removing these checks reduces the sanctity of this test, at least the way it reads serially (and was intended to be read)

What we want to check with this test still checked.

camilamacedo86 avatar Nov 10 '25 14:11 camilamacedo86

However, the downstream tests runs in paralallel that is why we can have more than 1 pod.

This argument is really confusing me. These tests have always run in parallel. By this logic we should never have seen this test passing.

Do we have any evidence to back up the claim that running these tests in parallel causes pod count to be more than 1, rendering the check useless?

Even if the problem was in fact parallelism, the solution here would be to mark these tests to run in serial, not strip away code that gives us confidence on these tests.

anik120 avatar Nov 10 '25 15:11 anik120

Hi @anik120

This argument is really confusing me.

Sorry, but can you please give a look at the erro: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_operator-framework-olm/1142/pull-ci-openshift-operator-framework-olm-main-e2e-gcp-olm/1987692171796418560 and description of this PR.

The only check that we need to do here is: "Does an active pod exist with the new image?"

If we have more than one pod ( as happened in downstream ) it should not fail if pass on that.

camilamacedo86 avatar Nov 10 '25 15:11 camilamacedo86

Hi @anik120

This argument is really confusing me.

Sorry, but can you please give a look at the erro: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_operator-framework-olm/1142/pull-ci-openshift-operator-framework-olm-main-e2e-gcp-olm/1987692171796418560 and description of this PR.

The only check that we need to do here is: "Does an active pod exist with the new image?"

If we have more than one pod ( as happened in downstream ) it should not fail if pass on that.

I'm a little confused as well. If the tests run in their own namespace, then parallelism shouldn't be a factor, right? Could the extra pod in the logs be the one still being deleted (which we now filter out in the test)?

perdasilva avatar Nov 13 '25 08:11 perdasilva