origin icon indicating copy to clipboard operation
origin copied to clipboard

e2e: node: testcase to exercise device manager recovery

Open ffromani opened this issue 2 years ago • 20 comments

When kubelet recovers pods, there's no guarantee about order of recovering. This means a pod requesting devices can be recovered, and can start, before the device plugin governing the device is started.

The end result is that containers can be admitted and start running without the requested device actually available, causing runtime errors in the workload. This is u/s issue https://github.com/kubernetes/kubernetes/issues/109595

With this commit, we add a e2e test, focusing on the most critical SNO scenario, which exercise the flow and verify the system behavior is fixed and workload pods requesting devices are rejected (not admitted) if they are recovered before the relevant device plugin.

ffromani avatar Apr 28 '23 13:04 ffromani

/hold still WIP

ffromani avatar Apr 28 '23 13:04 ffromani

the PR is now completed. cc @Tal-or @swatisehgal

ffromani avatar Apr 28 '23 15:04 ffromani

Other than the minor NIT, the PR looks good to me. Thanks for your work on this!

swatisehgal avatar Apr 28 '23 16:04 swatisehgal

addressed the review commants and functionally complete, but still on hold because further test is pending to adjust the timeouts. On cloud SNOs the timeouts are sufficient, but we want to be able to run the tests on BM, on which reboot can take long time.

ffromani avatar Apr 28 '23 17:04 ffromani

all comments addressed. Will unhold once the tuning of the timeouts is completed.

ffromani avatar May 02 '23 11:05 ffromani

/retest

ffromani avatar May 02 '23 16:05 ffromani

/retest

ffromani avatar May 03 '23 06:05 ffromani

/test ci/prow/e2e-aws-csi /test ci/prow/e2e-aws-ovn-fips /test ci/prow/e2e-gcp-ovn-etcd-scaling

ffromani avatar May 03 '23 10:05 ffromani

@ffromani: The specified target(s) for /test were not found. The following commands are available to trigger required jobs:

  • /test e2e-aws-jenkins
  • /test e2e-aws-ovn-fips
  • /test e2e-aws-ovn-image-registry
  • /test e2e-aws-ovn-serial
  • /test e2e-gcp-ovn
  • /test e2e-gcp-ovn-builds
  • /test e2e-gcp-ovn-image-ecosystem
  • /test e2e-gcp-ovn-upgrade
  • /test extended_gssapi
  • /test extended_ldap_groups
  • /test extended_networking
  • /test images
  • /test lint
  • /test unit
  • /test verify
  • /test verify-deps

The following commands are available to trigger optional jobs:

  • /test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback
  • /test e2e-agnostic-ovn-cmd
  • /test e2e-aws
  • /test e2e-aws-csi
  • /test e2e-aws-disruptive
  • /test e2e-aws-multitenant
  • /test e2e-aws-ovn
  • /test e2e-aws-ovn-cgroupsv2
  • /test e2e-aws-ovn-etcd-scaling
  • /test e2e-aws-ovn-single-node
  • /test e2e-aws-ovn-single-node-serial
  • /test e2e-aws-ovn-single-node-upgrade
  • /test e2e-aws-ovn-upgrade
  • /test e2e-aws-proxy
  • /test e2e-azure
  • /test e2e-azure-ovn-etcd-scaling
  • /test e2e-gcp-csi
  • /test e2e-gcp-disruptive
  • /test e2e-gcp-fips-serial
  • /test e2e-gcp-ovn-etcd-scaling
  • /test e2e-gcp-ovn-rt-upgrade
  • /test e2e-gcp-ovn-techpreview
  • /test e2e-gcp-ovn-techpreview-serial
  • /test e2e-metal-ipi-ovn-dualstack
  • /test e2e-metal-ipi-ovn-ipv6
  • /test e2e-metal-ipi-sdn
  • /test e2e-metal-ipi-serial
  • /test e2e-metal-ipi-serial-ovn-ipv6
  • /test e2e-metal-ipi-virtualmedia
  • /test e2e-openstack-kuryr
  • /test e2e-openstack-ovn
  • /test e2e-openstack-serial
  • /test e2e-vsphere
  • /test e2e-vsphere-ovn-etcd-scaling
  • /test okd-e2e-gcp

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-origin-master-e2e-agnostic-ovn-cmd
  • pull-ci-openshift-origin-master-e2e-aws-csi
  • pull-ci-openshift-origin-master-e2e-aws-ovn-cgroupsv2
  • pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling
  • pull-ci-openshift-origin-master-e2e-aws-ovn-fips
  • pull-ci-openshift-origin-master-e2e-aws-ovn-serial
  • pull-ci-openshift-origin-master-e2e-aws-ovn-single-node
  • pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial
  • pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade
  • pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade
  • pull-ci-openshift-origin-master-e2e-azure-ovn-etcd-scaling
  • pull-ci-openshift-origin-master-e2e-gcp-csi
  • pull-ci-openshift-origin-master-e2e-gcp-ovn
  • pull-ci-openshift-origin-master-e2e-gcp-ovn-builds
  • pull-ci-openshift-origin-master-e2e-gcp-ovn-etcd-scaling
  • pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade
  • pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade
  • pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6
  • pull-ci-openshift-origin-master-e2e-metal-ipi-sdn
  • pull-ci-openshift-origin-master-e2e-openstack-ovn
  • pull-ci-openshift-origin-master-e2e-vsphere-ovn-etcd-scaling
  • pull-ci-openshift-origin-master-images
  • pull-ci-openshift-origin-master-lint
  • pull-ci-openshift-origin-master-unit
  • pull-ci-openshift-origin-master-verify
  • pull-ci-openshift-origin-master-verify-deps

In response to this:

/test ci/prow/e2e-aws-csi /test ci/prow/e2e-aws-ovn-fips /test ci/prow/e2e-gcp-ovn-etcd-scaling

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci[bot] avatar May 03 '23 10:05 openshift-ci[bot]

/hold cancel

fixed the timeouts, further refinements are non-blocking

ffromani avatar May 03 '23 10:05 ffromani

/retest

ffromani avatar May 03 '23 11:05 ffromani

New changes are detected. LGTM label has been removed.

openshift-ci[bot] avatar May 09 '23 16:05 openshift-ci[bot]

so, running this test will cause openshift-tests to emit

[sig-arch] unknown image: registry.k8s.io/e2e-test-images/sample-device-plugin:1.5 (container/sample-device-plugin reason/Pulled image/registry.k8s.io/e2e-test-images/sample-device-plugin:1.5)
Writing JUnit report to /home/fromani/tmp/junit/junit_e2e__20230510-143327.xml

Suite run returned error: failed because an invariant was violated, 1 pass, 0 skip (5m24s)

error: failed because an invariant was violated, 1 pass, 0 skip (5m24s)

Which is because the added test consumes http://registry.k8s.io/e2e-test-images/sample-device-plugin:1.5. I'm going to investigate further why this image doesn't have a ImageID u/s I can use in test/utils/image (per steps in https://github.com/openshift/origin/blob/master/test/extended/util/image/README.md). Likely this is because the image is used by e2e_node - not e2e tests. But I'll doublecheck u/s

ffromani avatar May 10 '23 15:05 ffromani

/retest

PR had no changes and these lanes passed before

ffromani avatar May 31 '23 06:05 ffromani

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ffromani, swatisehgal Once this PR has been reviewed and has the lgtm label, please assign dgoodwin for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Aug 10 '23 11:08 openshift-ci[bot]

/retest

ffromani avatar Aug 11 '23 08:08 ffromani

@ffromani: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure-ovn-etcd-scaling ac16cdd4f9b1f3e7fc63b4a165df3f3069208e88 link false /test e2e-azure-ovn-etcd-scaling
ci/prow/e2e-vsphere-ovn-etcd-scaling ac16cdd4f9b1f3e7fc63b4a165df3f3069208e88 link false /test e2e-vsphere-ovn-etcd-scaling
ci/prow/e2e-aws-ovn-etcd-scaling ac16cdd4f9b1f3e7fc63b4a165df3f3069208e88 link false /test e2e-aws-ovn-etcd-scaling
ci/prow/e2e-agnostic-ovn-cmd 3818f667e83c11a7195ce98219fbf28983d41386 link false /test e2e-agnostic-ovn-cmd
ci/prow/e2e-aws-ovn-fips 3818f667e83c11a7195ce98219fbf28983d41386 link true /test e2e-aws-ovn-fips
ci/prow/verify-deps 3818f667e83c11a7195ce98219fbf28983d41386 link true /test verify-deps
ci/prow/e2e-aws-ovn-cgroupsv2 3818f667e83c11a7195ce98219fbf28983d41386 link false /test e2e-aws-ovn-cgroupsv2
ci/prow/e2e-aws-ovn-single-node-upgrade 3818f667e83c11a7195ce98219fbf28983d41386 link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-gcp-ovn 3818f667e83c11a7195ce98219fbf28983d41386 link true /test e2e-gcp-ovn
ci/prow/e2e-aws-ovn-single-node-serial 3818f667e83c11a7195ce98219fbf28983d41386 link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-gcp-ovn-rt-upgrade 3818f667e83c11a7195ce98219fbf28983d41386 link false /test e2e-gcp-ovn-rt-upgrade
ci/prow/e2e-aws-ovn-upgrade 3818f667e83c11a7195ce98219fbf28983d41386 link false /test e2e-aws-ovn-upgrade
ci/prow/e2e-gcp-ovn-upgrade 3818f667e83c11a7195ce98219fbf28983d41386 link true /test e2e-gcp-ovn-upgrade
ci/prow/e2e-metal-ipi-ovn-ipv6 3818f667e83c11a7195ce98219fbf28983d41386 link false /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-aws-ovn-single-node 3818f667e83c11a7195ce98219fbf28983d41386 link false /test e2e-aws-ovn-single-node
ci/prow/e2e-metal-ipi-sdn 3818f667e83c11a7195ce98219fbf28983d41386 link false /test e2e-metal-ipi-sdn
ci/prow/verify 3818f667e83c11a7195ce98219fbf28983d41386 link true /test verify
ci/prow/unit 3818f667e83c11a7195ce98219fbf28983d41386 link true /test unit
ci/prow/e2e-aws-ovn-serial 3818f667e83c11a7195ce98219fbf28983d41386 link true /test e2e-aws-ovn-serial
ci/prow/e2e-openstack-ovn 3818f667e83c11a7195ce98219fbf28983d41386 link false /test e2e-openstack-ovn
ci/prow/e2e-aws-csi 3818f667e83c11a7195ce98219fbf28983d41386 link false /test e2e-aws-csi
ci/prow/images 3818f667e83c11a7195ce98219fbf28983d41386 link true /test images
ci/prow/e2e-gcp-csi 3818f667e83c11a7195ce98219fbf28983d41386 link false /test e2e-gcp-csi
ci/prow/lint 3818f667e83c11a7195ce98219fbf28983d41386 link true /test lint
ci/prow/e2e-gcp-ovn-builds 3818f667e83c11a7195ce98219fbf28983d41386 link true /test e2e-gcp-ovn-builds

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci[bot] avatar Oct 13 '23 06:10 openshift-ci[bot]

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot avatar Jan 11 '24 09:01 openshift-bot

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot avatar Feb 11 '24 00:02 openshift-bot

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-merge-robot avatar Feb 11 '24 00:02 openshift-merge-robot

after almost 1 year, I think this is not gonna make it. Too bad.

ffromani avatar Mar 06 '24 16:03 ffromani