e2e: node: testcase to exercise device manager recovery
When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start, before the device plugin governing the device is started.
The end result is that containers can be admitted and start running without the requested device actually available, causing runtime errors in the workload. This is u/s issue https://github.com/kubernetes/kubernetes/issues/109595
With this commit, we add a e2e test, focusing on the most critical SNO scenario, which exercise the flow and verify the system behavior is fixed and workload pods requesting devices are rejected (not admitted) if they are recovered before the relevant device plugin.
/hold still WIP
the PR is now completed. cc @Tal-or @swatisehgal
Other than the minor NIT, the PR looks good to me. Thanks for your work on this!
addressed the review commants and functionally complete, but still on hold because further test is pending to adjust the timeouts. On cloud SNOs the timeouts are sufficient, but we want to be able to run the tests on BM, on which reboot can take long time.
all comments addressed. Will unhold once the tuning of the timeouts is completed.
/retest
/retest
/test ci/prow/e2e-aws-csi /test ci/prow/e2e-aws-ovn-fips /test ci/prow/e2e-gcp-ovn-etcd-scaling
@ffromani: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:
/test e2e-aws-jenkins/test e2e-aws-ovn-fips/test e2e-aws-ovn-image-registry/test e2e-aws-ovn-serial/test e2e-gcp-ovn/test e2e-gcp-ovn-builds/test e2e-gcp-ovn-image-ecosystem/test e2e-gcp-ovn-upgrade/test extended_gssapi/test extended_ldap_groups/test extended_networking/test images/test lint/test unit/test verify/test verify-deps
The following commands are available to trigger optional jobs:
/test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback/test e2e-agnostic-ovn-cmd/test e2e-aws/test e2e-aws-csi/test e2e-aws-disruptive/test e2e-aws-multitenant/test e2e-aws-ovn/test e2e-aws-ovn-cgroupsv2/test e2e-aws-ovn-etcd-scaling/test e2e-aws-ovn-single-node/test e2e-aws-ovn-single-node-serial/test e2e-aws-ovn-single-node-upgrade/test e2e-aws-ovn-upgrade/test e2e-aws-proxy/test e2e-azure/test e2e-azure-ovn-etcd-scaling/test e2e-gcp-csi/test e2e-gcp-disruptive/test e2e-gcp-fips-serial/test e2e-gcp-ovn-etcd-scaling/test e2e-gcp-ovn-rt-upgrade/test e2e-gcp-ovn-techpreview/test e2e-gcp-ovn-techpreview-serial/test e2e-metal-ipi-ovn-dualstack/test e2e-metal-ipi-ovn-ipv6/test e2e-metal-ipi-sdn/test e2e-metal-ipi-serial/test e2e-metal-ipi-serial-ovn-ipv6/test e2e-metal-ipi-virtualmedia/test e2e-openstack-kuryr/test e2e-openstack-ovn/test e2e-openstack-serial/test e2e-vsphere/test e2e-vsphere-ovn-etcd-scaling/test okd-e2e-gcp
Use /test all to run the following jobs that were automatically triggered:
pull-ci-openshift-origin-master-e2e-agnostic-ovn-cmdpull-ci-openshift-origin-master-e2e-aws-csipull-ci-openshift-origin-master-e2e-aws-ovn-cgroupsv2pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scalingpull-ci-openshift-origin-master-e2e-aws-ovn-fipspull-ci-openshift-origin-master-e2e-aws-ovn-serialpull-ci-openshift-origin-master-e2e-aws-ovn-single-nodepull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serialpull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgradepull-ci-openshift-origin-master-e2e-aws-ovn-upgradepull-ci-openshift-origin-master-e2e-azure-ovn-etcd-scalingpull-ci-openshift-origin-master-e2e-gcp-csipull-ci-openshift-origin-master-e2e-gcp-ovnpull-ci-openshift-origin-master-e2e-gcp-ovn-buildspull-ci-openshift-origin-master-e2e-gcp-ovn-etcd-scalingpull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgradepull-ci-openshift-origin-master-e2e-gcp-ovn-upgradepull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6pull-ci-openshift-origin-master-e2e-metal-ipi-sdnpull-ci-openshift-origin-master-e2e-openstack-ovnpull-ci-openshift-origin-master-e2e-vsphere-ovn-etcd-scalingpull-ci-openshift-origin-master-imagespull-ci-openshift-origin-master-lintpull-ci-openshift-origin-master-unitpull-ci-openshift-origin-master-verifypull-ci-openshift-origin-master-verify-deps
In response to this:
/test ci/prow/e2e-aws-csi /test ci/prow/e2e-aws-ovn-fips /test ci/prow/e2e-gcp-ovn-etcd-scaling
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/hold cancel
fixed the timeouts, further refinements are non-blocking
/retest
New changes are detected. LGTM label has been removed.
so, running this test will cause openshift-tests to emit
[sig-arch] unknown image: registry.k8s.io/e2e-test-images/sample-device-plugin:1.5 (container/sample-device-plugin reason/Pulled image/registry.k8s.io/e2e-test-images/sample-device-plugin:1.5)
Writing JUnit report to /home/fromani/tmp/junit/junit_e2e__20230510-143327.xml
Suite run returned error: failed because an invariant was violated, 1 pass, 0 skip (5m24s)
error: failed because an invariant was violated, 1 pass, 0 skip (5m24s)
Which is because the added test consumes http://registry.k8s.io/e2e-test-images/sample-device-plugin:1.5.
I'm going to investigate further why this image doesn't have a ImageID u/s I can use in test/utils/image (per steps in https://github.com/openshift/origin/blob/master/test/extended/util/image/README.md). Likely this is because the image is used by e2e_node - not e2e tests. But I'll doublecheck u/s
/retest
PR had no changes and these lanes passed before
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: ffromani, swatisehgal Once this PR has been reviewed and has the lgtm label, please assign dgoodwin for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/retest
@ffromani: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| ci/prow/e2e-azure-ovn-etcd-scaling | ac16cdd4f9b1f3e7fc63b4a165df3f3069208e88 | link | false | /test e2e-azure-ovn-etcd-scaling |
| ci/prow/e2e-vsphere-ovn-etcd-scaling | ac16cdd4f9b1f3e7fc63b4a165df3f3069208e88 | link | false | /test e2e-vsphere-ovn-etcd-scaling |
| ci/prow/e2e-aws-ovn-etcd-scaling | ac16cdd4f9b1f3e7fc63b4a165df3f3069208e88 | link | false | /test e2e-aws-ovn-etcd-scaling |
| ci/prow/e2e-agnostic-ovn-cmd | 3818f667e83c11a7195ce98219fbf28983d41386 | link | false | /test e2e-agnostic-ovn-cmd |
| ci/prow/e2e-aws-ovn-fips | 3818f667e83c11a7195ce98219fbf28983d41386 | link | true | /test e2e-aws-ovn-fips |
| ci/prow/verify-deps | 3818f667e83c11a7195ce98219fbf28983d41386 | link | true | /test verify-deps |
| ci/prow/e2e-aws-ovn-cgroupsv2 | 3818f667e83c11a7195ce98219fbf28983d41386 | link | false | /test e2e-aws-ovn-cgroupsv2 |
| ci/prow/e2e-aws-ovn-single-node-upgrade | 3818f667e83c11a7195ce98219fbf28983d41386 | link | false | /test e2e-aws-ovn-single-node-upgrade |
| ci/prow/e2e-gcp-ovn | 3818f667e83c11a7195ce98219fbf28983d41386 | link | true | /test e2e-gcp-ovn |
| ci/prow/e2e-aws-ovn-single-node-serial | 3818f667e83c11a7195ce98219fbf28983d41386 | link | false | /test e2e-aws-ovn-single-node-serial |
| ci/prow/e2e-gcp-ovn-rt-upgrade | 3818f667e83c11a7195ce98219fbf28983d41386 | link | false | /test e2e-gcp-ovn-rt-upgrade |
| ci/prow/e2e-aws-ovn-upgrade | 3818f667e83c11a7195ce98219fbf28983d41386 | link | false | /test e2e-aws-ovn-upgrade |
| ci/prow/e2e-gcp-ovn-upgrade | 3818f667e83c11a7195ce98219fbf28983d41386 | link | true | /test e2e-gcp-ovn-upgrade |
| ci/prow/e2e-metal-ipi-ovn-ipv6 | 3818f667e83c11a7195ce98219fbf28983d41386 | link | false | /test e2e-metal-ipi-ovn-ipv6 |
| ci/prow/e2e-aws-ovn-single-node | 3818f667e83c11a7195ce98219fbf28983d41386 | link | false | /test e2e-aws-ovn-single-node |
| ci/prow/e2e-metal-ipi-sdn | 3818f667e83c11a7195ce98219fbf28983d41386 | link | false | /test e2e-metal-ipi-sdn |
| ci/prow/verify | 3818f667e83c11a7195ce98219fbf28983d41386 | link | true | /test verify |
| ci/prow/unit | 3818f667e83c11a7195ce98219fbf28983d41386 | link | true | /test unit |
| ci/prow/e2e-aws-ovn-serial | 3818f667e83c11a7195ce98219fbf28983d41386 | link | true | /test e2e-aws-ovn-serial |
| ci/prow/e2e-openstack-ovn | 3818f667e83c11a7195ce98219fbf28983d41386 | link | false | /test e2e-openstack-ovn |
| ci/prow/e2e-aws-csi | 3818f667e83c11a7195ce98219fbf28983d41386 | link | false | /test e2e-aws-csi |
| ci/prow/images | 3818f667e83c11a7195ce98219fbf28983d41386 | link | true | /test images |
| ci/prow/e2e-gcp-csi | 3818f667e83c11a7195ce98219fbf28983d41386 | link | false | /test e2e-gcp-csi |
| ci/prow/lint | 3818f667e83c11a7195ce98219fbf28983d41386 | link | true | /test lint |
| ci/prow/e2e-gcp-ovn-builds | 3818f667e83c11a7195ce98219fbf28983d41386 | link | true | /test e2e-gcp-ovn-builds |
Full PR test history. Your PR dashboard.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle rotten /remove-lifecycle stale
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
after almost 1 year, I think this is not gonna make it. Too bad.