cloud-provider-gcp icon indicating copy to clipboard operation
cloud-provider-gcp copied to clipboard

Migrate e2e tests for GPUs

Open seans3 opened this issue 1 year ago • 8 comments

  • Migrates 3 GPU upgrade/downgrade e2e tests from k/k in-tree to this cloud-provider-gcp repository. Migrated from file: https://github.com/kubernetes/kubernetes/blob/release-1.30/test/e2e/cloud/gcp/node/gpu.go

  • Migrates 2 Nvidia GPU e2e tests from k/k in-tree to this cloud-provider-gcp repository. Migrated from file: https://github.com/kubernetes/kubernetes/blob/release-1.30/test/e2e/scheduling/nvidia-gpus.go

  • Migrates 1 StackDriver instrumentation e2e tests from k/k in-tree to this cloud-provider-gcp repository. Migrated from file: https://github.com/kubernetes/kubernetes/blob/release-1.30/test/e2e/instrumentation/monitoring/accelerator.go

Status

Currently failing with pods from Nvidia driver daemonset timing out

Summarizing 6 Failures:
  [FAIL] [cloud-provider-gcp-e2e] Stackdriver Monitoring [It] should have accelerator metrics
  /home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:206
  [FAIL] [cloud-provider-gcp-e2e] GPUDevicePluginAcrossRecreate [It] run Nvidia GPU Device Plugin tests with a recreation
  /home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:206
  [FAIL] [cloud-provider-gcp-e2e] Device Plugin GPUs [It] run Nvidia GPU Device Plugin tests
  /home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:206
  [FAIL] [cloud-provider-gcp-e2e] GPU Upgrade cluster upgrade [It] should be able to run gpu pod after upgrade
  /home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:206
  [FAIL] [cloud-provider-gcp-e2e] GPU Upgrade master upgrade [It] should NOT disrupt gpu pod
  /home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:206
  [FAIL] [cloud-provider-gcp-e2e] GPU Upgrade cluster downgrade [It] should be able to run gpu pod after downgrade
  /home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:206

Ran 9 of 290 Specs in 192.499 seconds
FAIL! -- 3 Passed | 6 Failed | 0 Pending | 281 Skipped
 [FAILED] failed to get pods controlled by the nvidia-driver-installer daemonset: Timed out after 60.000s.
  expected at least 1 pods, only got 0

 I0529 17:55:52.560354 1857376 upgrade_context.go:86] Version for "ci/latest" is "v1.31.0-alpha.0.983+f44bb5e6e58c31\n"
  I0529 17:55:52.677652 1857376 nvidia-gpu.go:144] Nodename: kt2-1717029961633-master, OS Image: Container-Optimized OS from Google
  I0529 17:55:52.677662 1857376 nvidia-gpu.go:144] Nodename: kt2-1717029961633-minion-group-2hm0, OS Image: Container-Optimized OS from Google
  I0529 17:55:52.677664 1857376 nvidia-gpu.go:144] Nodename: kt2-1717029961633-minion-group-br6z, OS Image: Container-Optimized OS from Google
  I0529 17:55:52.677666 1857376 nvidia-gpu.go:144] Nodename: kt2-1717029961633-minion-group-d2g3, OS Image: Container-Optimized OS from Google
  I0529 17:55:52.677669 1857376 nvidia-gpu.go:101] Using default local nvidia-driver-installer daemonset manifest.
  I0529 17:55:52.743577 1857376 nvidia-gpu.go:112] Successfully created daemonset to install Nvidia drivers.
  I0529 17:56:52.803452 1857376 nvidia-gpu.go:115] Failed inside E2E framework:
      k8s.io/kubernetes/test/e2e/framework/pod.WaitForPods({0x7b1611463800, 0xc0005fc2d0}, {0x43623f0, 0xc0013008c0}, {0xc000aac7b0, 0x2f}, {{{0x0, 0x0}, {0x0, 0x0}}, ...}, ...)
      	/home/sean/go/pkg/mod/k8s.io/[email protected]/test/e2e/framework/pod/wait.go:327 +0x625
      k8s.io/kubernetes/test/e2e/framework/pod.WaitForPodsWithLabel({0x7b1611463800, 0xc0005fc2d0}, {0x43623f0, 0xc0013008c0}, {0xc000aac7b0, 0x2f}, {0x433ace0?, 0xc000aea8a0?})
      	/home/sean/go/pkg/mod/k8s.io/[email protected]/test/e2e/framework/pod/wait.go:657 +0x119
      k8s.io/kubernetes/test/e2e/framework/resource.WaitForControlledPods({0x7b1611463800, 0xc0005fc2d0}, {0x43623f0, 0xc0013008c0}, {0xc000aac7b0, 0x2f}, {0xc000161e60?, 0xc0008e13c0?}, {{0x3db2e53, 0xa}, ...})
      	/home/sean/go/pkg/mod/k8s.io/[email protected]/test/e2e/framework/resource/resources.go:249 +0xd8
      k8s.io/cloud-provider-gcp/tests/e2e.SetupNVIDIAGPUNode({0x7b1611463800, 0xc0005fc2d0}, 0xc0010a0000, 0x0)
      	/home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:114 +0x4cc
      k8s.io/cloud-provider-gcp/tests/e2e.(*NvidiaGPUUpgradeTest).Setup(0xc000fd7728?, {0x7b1611463800, 0xc0005fc2d0}, 0xc0010a0000)
      	/home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:61 +0x2d
      k8s.io/cloud-provider-gcp/tests/e2e.(*chaosMonkeyAdapter).Test(0xc001069480, {0x7b1611463800, 0xc0005fc2d0}, 0xc00095c1e0)
      	/home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/gpu.go:183 +0x1ce
      k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do.func1()
      	/home/sean/go/pkg/mod/k8s.io/[email protected]/test/e2e/chaosmonkey/chaosmonkey.go:95 +0x6c
      created by k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do in goroutine 125
      	/home/sean/go/pkg/mod/k8s.io/[email protected]/test/e2e/chaosmonkey/chaosmonkey.go:92 +0xa5
  I0529 17:56:52.803522 1857376 util.go:650] Running ../../cluster/gce/upgrade.sh [-M v1.31.0-alpha.0.983+f44bb5e6e58c31]
  I0529 17:57:05.616098 1857376 upgrade_mechanics.go:40] Unexpected error: 
      <*errors.errorString | 0xc0006c7860>: 
      error running ../../cluster/gce/upgrade.sh [-M v1.31.0-alpha.0.983+f44bb5e6e58c31]; got error exit status 1, stdout "Fetching the previously installed CoreDNS version\nThe default etcd storage media type in 1.6 has changed from application/json to application/vnd.kubernetes.protobuf.\nDocumentation about the change can be found at https://kubernetes.io/docs/admin/etcd_upgrade.\n\nETCD2 DOES NOT SUPPORT PROTOBUF: If you wish to have to ability to downgrade to etcd2 later application/json must be used.\n\nIt's HIGHLY recommended that etcd be backed up before this step!!\n\nTo enable using json, before running this script set:\nexport STORAGE_MEDIA_TYPE=application/json\n\nTo enable using protobuf, before running this script set:\nexport STORAGE_MEDIA_TYPE=application/vnd.kubernetes.protobuf\n\n", stderr "Using image: cos-109-17800-218-37 from project: cos-cloud as master image\nUsing image: cos-109-17800-218-37 from project: cos-cloud as node image\nUsing image: cos-109-17800-218-37 from project: cos-cloud as master image\nUsing image: cos-109-17800-218-37 from project: cos-cloud as node image\nUsing image: cos-109-17800-218-37 from project: cos-cloud as master image\nUsing image: cos-109-17800-218-37 from project: cos-cloud as node image\nSTORAGE_MEDIA_TYPE must be specified when run non-interactively.\n"
      {
          s: "error running ../../cluster/gce/upgrade.sh [-M v1.31.0-alpha.0.983+f44bb5e6e58c31]; got error exit status 1, stdout \"Fetching the previously installed CoreDNS version\\nThe default etcd storage media type in 1.6 has changed from application/json to application/vnd.kubernetes.protobuf.\\nDocumentation about the change can be found at https://kubernetes.io/docs/admin/etcd_upgrade.\\n\\nETCD2 DOES NOT SUPPORT PROTOBUF: If you wish to have to ability to downgrade to etcd2 later application/json must be used.\\n\\nIt's HIGHLY recommended that etcd be backed up before this step!!\\n\\nTo enable using json, before running this script set:\\nexport STORAGE_MEDIA_TYPE=application/json\\n\\nTo enable using protobuf, before running this script set:\\nexport STORAGE_MEDIA_TYPE=application/vnd.kubernetes.protobuf\\n\\n\", stderr \"Using image: cos-109-17800-218-37 from project: cos-cloud as master image\\nUsing image: cos-109-17800-218-37 from project: cos-cloud as node image\\nUsing image: cos-109-17800-218-37 from project: cos-cloud as master image\\nUsing image: cos-109-17800-218-37 from project: cos-cloud as node image\\nUsing image: cos-109-17800-218-37 from project: cos-cloud as master image\\nUsing image: cos-109-17800-218-37 from project: cos-cloud as node image\\nSTORAGE_MEDIA_TYPE must be specified when run non-interactively.\\n\"",
      }
  [FAILED] in [It] - /home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:115 @ 05/29/24 17:57:05.617
  STEP: Destroying namespace "nvidia-gpu-upgrade-sig-node-sig-scheduling-5395" for this suite. @ 05/29/24 17:57:05.618
  STEP: Destroying namespace "gpu-upgrade-5929" for this suite. @ 05/29/24 17:57:05.687

seans3 avatar May 30 '24 01:05 seans3

This issue is currently awaiting triage.

If the repository mantainers determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar May 30 '24 01:05 k8s-ci-robot

/assign @BenTheElder

seans3 avatar May 30 '24 01:05 seans3

/test pull-cloud-provider-gcp-e2e

seans3 avatar May 30 '24 01:05 seans3

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: seans3 Once this PR has been reviewed and has the lgtm label, please ask for approval from bentheelder. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar May 30 '24 03:05 k8s-ci-robot

/test pull-cloud-provider-gcp-e2e

seans3 avatar May 30 '24 03:05 seans3

@seans3: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cloud-provider-gcp-e2e 8a4d4954169a26e8da89833f7a62aaadd2a6760c link false /test pull-cloud-provider-gcp-e2e

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot avatar May 30 '24 04:05 k8s-ci-robot

/hold

seans3 avatar Jun 04 '24 05:06 seans3

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 02 '24 06:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Oct 02 '24 07:10 k8s-triage-robot

We have new generic nvidia GPU tests upstream now on AWS + GCP.

maybe we don't want to maintain these anymore

BenTheElder avatar Oct 03 '24 05:10 BenTheElder

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-triage-robot avatar Nov 02 '24 06:11 k8s-triage-robot

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Nov 02 '24 06:11 k8s-ci-robot