🌱 (e2e): test: skip terminating pods in catalog image update test
Description
Fix flaky e2e test [CatalogSource] image update that fails when terminating pods are present during catalog source rollouts.
Problem
The test was failing with:
unexpected number of registry pods found
Expected <[]v1.Pod | len:2> to have length 1
During catalog source image updates, there can be 2 pods temporarily:
- 1 old pod being deleted (with
DeletionTimestampset) - 1 new pod starting up
The test was counting terminating pods and failing instead of focusing on the actual requirement: verifying the catalog image was updated.
Solution
Skip pods with DeletionTimestamp != nil when checking for the updated image.
This makes the test resilient to transient states during pod rollouts while still verifying the core requirement.
Changes
- Add check to skip terminating pods in
podCheckFunc(3 lines) - Test now focuses on: "Does an active pod exist with the new image?"
Testing
- Fixes downstream failure: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_operator-framework-olm/1142/pull-ci-openshift-operator-framework-olm-main-e2e-gcp-olm/1987692171796418560
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign kevinrizza for approval. For more information see the Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Extra checks never hurt.
Taht is not true See; Fixes downstream failure: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_operator-framework-olm/1142/pull-ci-openshift-operator-framework-olm-main-e2e-gcp-olm/1987692171796418560
We are checking if has only 1 pod. However, the downstream tests runs in paralallel that is why we can have more than 1 pod.
My ideal solution here would be to not remove these checks in this PR, observe if we're still getting flaky results, and then look into why we are (possibly) "still failing".
By looking at the failure we can know the error already.
My main concern is that removing these checks reduces the sanctity of this test, at least the way it reads serially (and was intended to be read)
What we want to check with this test still checked.
However, the downstream tests runs in paralallel that is why we can have more than 1 pod.
This argument is really confusing me. These tests have always run in parallel. By this logic we should never have seen this test passing.
Do we have any evidence to back up the claim that running these tests in parallel causes pod count to be more than 1, rendering the check useless?
Even if the problem was in fact parallelism, the solution here would be to mark these tests to run in serial, not strip away code that gives us confidence on these tests.
Hi @anik120
This argument is really confusing me.
Sorry, but can you please give a look at the erro: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_operator-framework-olm/1142/pull-ci-openshift-operator-framework-olm-main-e2e-gcp-olm/1987692171796418560 and description of this PR.
The only check that we need to do here is: "Does an active pod exist with the new image?"
If we have more than one pod ( as happened in downstream ) it should not fail if pass on that.
Hi @anik120
This argument is really confusing me.
Sorry, but can you please give a look at the erro: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_operator-framework-olm/1142/pull-ci-openshift-operator-framework-olm-main-e2e-gcp-olm/1987692171796418560 and description of this PR.
The only check that we need to do here is: "Does an active pod exist with the new image?"
If we have more than one pod ( as happened in downstream ) it should not fail if pass on that.
I'm a little confused as well. If the tests run in their own namespace, then parallelism shouldn't be a factor, right? Could the extra pod in the logs be the one still being deleted (which we now filter out in the test)?