cloud-provider-gcp Migrate e2e tests for GPUs

Migrates 3 GPU upgrade/downgrade e2e tests from k/k in-tree to this cloud-provider-gcp repository. Migrated from file: https://github.com/kubernetes/kubernetes/blob/release-1.30/test/e2e/cloud/gcp/node/gpu.go
Migrates 2 Nvidia GPU e2e tests from k/k in-tree to this cloud-provider-gcp repository. Migrated from file: https://github.com/kubernetes/kubernetes/blob/release-1.30/test/e2e/scheduling/nvidia-gpus.go
Migrates 1 StackDriver instrumentation e2e tests from k/k in-tree to this cloud-provider-gcp repository. Migrated from file: https://github.com/kubernetes/kubernetes/blob/release-1.30/test/e2e/instrumentation/monitoring/accelerator.go

Status

Currently failing with pods from Nvidia driver daemonset timing out

Summarizing 6 Failures:
  [FAIL] [cloud-provider-gcp-e2e] Stackdriver Monitoring [It] should have accelerator metrics
  /home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:206
  [FAIL] [cloud-provider-gcp-e2e] GPUDevicePluginAcrossRecreate [It] run Nvidia GPU Device Plugin tests with a recreation
  /home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:206
  [FAIL] [cloud-provider-gcp-e2e] Device Plugin GPUs [It] run Nvidia GPU Device Plugin tests
  /home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:206
  [FAIL] [cloud-provider-gcp-e2e] GPU Upgrade cluster upgrade [It] should be able to run gpu pod after upgrade
  /home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:206
  [FAIL] [cloud-provider-gcp-e2e] GPU Upgrade master upgrade [It] should NOT disrupt gpu pod
  /home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:206
  [FAIL] [cloud-provider-gcp-e2e] GPU Upgrade cluster downgrade [It] should be able to run gpu pod after downgrade
  /home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:206

Ran 9 of 290 Specs in 192.499 seconds
FAIL! -- 3 Passed | 6 Failed | 0 Pending | 281 Skipped

 [FAILED] failed to get pods controlled by the nvidia-driver-installer daemonset: Timed out after 60.000s.
  expected at least 1 pods, only got 0

 I0529 17:55:52.560354 1857376 upgrade_context.go:86] Version for "ci/latest" is "v1.31.0-alpha.0.983+f44bb5e6e58c31\n"
  I0529 17:55:52.677652 1857376 nvidia-gpu.go:144] Nodename: kt2-1717029961633-master, OS Image: Container-Optimized OS from Google
  I0529 17:55:52.677662 1857376 nvidia-gpu.go:144] Nodename: kt2-1717029961633-minion-group-2hm0, OS Image: Container-Optimized OS from Google
  I0529 17:55:52.677664 1857376 nvidia-gpu.go:144] Nodename: kt2-1717029961633-minion-group-br6z, OS Image: Container-Optimized OS from Google
  I0529 17:55:52.677666 1857376 nvidia-gpu.go:144] Nodename: kt2-1717029961633-minion-group-d2g3, OS Image: Container-Optimized OS from Google
  I0529 17:55:52.677669 1857376 nvidia-gpu.go:101] Using default local nvidia-driver-installer daemonset manifest.
  I0529 17:55:52.743577 1857376 nvidia-gpu.go:112] Successfully created daemonset to install Nvidia drivers.
  I0529 17:56:52.803452 1857376 nvidia-gpu.go:115] Failed inside E2E framework:
      k8s.io/kubernetes/test/e2e/framework/pod.WaitForPods({0x7b1611463800, 0xc0005fc2d0}, {0x43623f0, 0xc0013008c0}, {0xc000aac7b0, 0x2f}, {{{0x0, 0x0}, {0x0, 0x0}}, ...}, ...)
      	/home/sean/go/pkg/mod/k8s.io/[email protected]/test/e2e/framework/pod/wait.go:327 +0x625
      k8s.io/kubernetes/test/e2e/framework/pod.WaitForPodsWithLabel({0x7b1611463800, 0xc0005fc2d0}, {0x43623f0, 0xc0013008c0}, {0xc000aac7b0, 0x2f}, {0x433ace0?, 0xc000aea8a0?})
      	/home/sean/go/pkg/mod/k8s.io/[email protected]/test/e2e/framework/pod/wait.go:657 +0x119
      k8s.io/kubernetes/test/e2e/framework/resource.WaitForControlledPods({0x7b1611463800, 0xc0005fc2d0}, {0x43623f0, 0xc0013008c0}, {0xc000aac7b0, 0x2f}, {0xc000161e60?, 0xc0008e13c0?}, {{0x3db2e53, 0xa}, ...})
      	/home/sean/go/pkg/mod/k8s.io/[email protected]/test/e2e/framework/resource/resources.go:249 +0xd8
      k8s.io/cloud-provider-gcp/tests/e2e.SetupNVIDIAGPUNode({0x7b1611463800, 0xc0005fc2d0}, 0xc0010a0000, 0x0)
      	/home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:114 +0x4cc
      k8s.io/cloud-provider-gcp/tests/e2e.(*NvidiaGPUUpgradeTest).Setup(0xc000fd7728?, {0x7b1611463800, 0xc0005fc2d0}, 0xc0010a0000)
      	/home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:61 +0x2d
      k8s.io/cloud-provider-gcp/tests/e2e.(*chaosMonkeyAdapter).Test(0xc001069480, {0x7b1611463800, 0xc0005fc2d0}, 0xc00095c1e0)
      	/home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/gpu.go:183 +0x1ce
      k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do.func1()
      	/home/sean/go/pkg/mod/k8s.io/[email protected]/test/e2e/chaosmonkey/chaosmonkey.go:95 +0x6c
      created by k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do in goroutine 125
      	/home/sean/go/pkg/mod/k8s.io/[email protected]/test/e2e/chaosmonkey/chaosmonkey.go:92 +0xa5
  I0529 17:56:52.803522 1857376 util.go:650] Running ../../cluster/gce/upgrade.sh [-M v1.31.0-alpha.0.983+f44bb5e6e58c31]
  I0529 17:57:05.616098 1857376 upgrade_mechanics.go:40] Unexpected error: 
      <*errors.errorString | 0xc0006c7860>: 
      error running ../../cluster/gce/upgrade.sh [-M v1.31.0-alpha.0.983+f44bb5e6e58c31]; got error exit status 1, stdout "Fetching the previously installed CoreDNS version\nThe default etcd storage media type in 1.6 has changed from application/json to application/vnd.kubernetes.protobuf.\nDocumentation about the change can be found at https://kubernetes.io/docs/admin/etcd_upgrade.\n\nETCD2 DOES NOT SUPPORT PROTOBUF: If you wish to have to ability to downgrade to etcd2 later application/json must be used.\n\nIt's HIGHLY recommended that etcd be backed up before this step!!\n\nTo enable using json, before running this script set:\nexport STORAGE_MEDIA_TYPE=application/json\n\nTo enable using protobuf, before running this script set:\nexport STORAGE_MEDIA_TYPE=application/vnd.kubernetes.protobuf\n\n", stderr "Using image: cos-109-17800-218-37 from project: cos-cloud as master image\nUsing image: cos-109-17800-218-37 from project: cos-cloud as node image\nUsing image: cos-109-17800-218-37 from project: cos-cloud as master image\nUsing image: cos-109-17800-218-37 from project: cos-cloud as node image\nUsing image: cos-109-17800-218-37 from project: cos-cloud as master image\nUsing image: cos-109-17800-218-37 from project: cos-cloud as node image\nSTORAGE_MEDIA_TYPE must be specified when run non-interactively.\n"
      {
          s: "error running ../../cluster/gce/upgrade.sh [-M v1.31.0-alpha.0.983+f44bb5e6e58c31]; got error exit status 1, stdout \"Fetching the previously installed CoreDNS version\\nThe default etcd storage media type in 1.6 has changed from application/json to application/vnd.kubernetes.protobuf.\\nDocumentation about the change can be found at https://kubernetes.io/docs/admin/etcd_upgrade.\\n\\nETCD2 DOES NOT SUPPORT PROTOBUF: If you wish to have to ability to downgrade to etcd2 later application/json must be used.\\n\\nIt's HIGHLY recommended that etcd be backed up before this step!!\\n\\nTo enable using json, before running this script set:\\nexport STORAGE_MEDIA_TYPE=application/json\\n\\nTo enable using protobuf, before running this script set:\\nexport STORAGE_MEDIA_TYPE=application/vnd.kubernetes.protobuf\\n\\n\", stderr \"Using image: cos-109-17800-218-37 from project: cos-cloud as master image\\nUsing image: cos-109-17800-218-37 from project: cos-cloud as node image\\nUsing image: cos-109-17800-218-37 from project: cos-cloud as master image\\nUsing image: cos-109-17800-218-37 from project: cos-cloud as node image\\nUsing image: cos-109-17800-218-37 from project: cos-cloud as master image\\nUsing image: cos-109-17800-218-37 from project: cos-cloud as node image\\nSTORAGE_MEDIA_TYPE must be specified when run non-interactively.\\n\"",
      }
  [FAILED] in [It] - /home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:115 @ 05/29/24 17:57:05.617
  STEP: Destroying namespace "nvidia-gpu-upgrade-sig-node-sig-scheduling-5395" for this suite. @ 05/29/24 17:57:05.618
  STEP: Destroying namespace "gpu-upgrade-5929" for this suite. @ 05/29/24 17:57:05.687

May 30 '24 01:05 seans3

This issue is currently awaiting triage.

If the repository mantainers determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

May 30 '24 01:05 k8s-ci-robot

/assign @BenTheElder

May 30 '24 01:05 seans3

/test pull-cloud-provider-gcp-e2e

May 30 '24 01:05 seans3

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: seans3 Once this PR has been reviewed and has the lgtm label, please ask for approval from bentheelder. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

May 30 '24 03:05 k8s-ci-robot

/test pull-cloud-provider-gcp-e2e

May 30 '24 03:05 seans3

@seans3: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cloud-provider-gcp-e2e	8a4d4954169a26e8da89833f7a62aaadd2a6760c	link	false	`/test pull-cloud-provider-gcp-e2e`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

May 30 '24 04:05 k8s-ci-robot

/hold

Jun 04 '24 05:06 seans3

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Sep 02 '24 06:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Oct 02 '24 07:10 k8s-triage-robot

We have new generic nvidia GPU tests upstream now on AWS + GCP.

maybe we don't want to maintain these anymore

Oct 03 '24 05:10 BenTheElder

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Nov 02 '24 06:11 k8s-triage-robot

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Nov 02 '24 06:11 k8s-ci-robot