cloud-provider-gcp icon indicating copy to clipboard operation
cloud-provider-gcp copied to clipboard

Migrate e2e tests for GPUs

Open seans3 opened this issue 8 months ago • 8 comments

  • Migrates 3 GPU upgrade/downgrade e2e tests from k/k in-tree to this cloud-provider-gcp repository. Migrated from file: https://github.com/kubernetes/kubernetes/blob/release-1.30/test/e2e/cloud/gcp/node/gpu.go

  • Migrates 2 Nvidia GPU e2e tests from k/k in-tree to this cloud-provider-gcp repository. Migrated from file: https://github.com/kubernetes/kubernetes/blob/release-1.30/test/e2e/scheduling/nvidia-gpus.go

  • Migrates 1 StackDriver instrumentation e2e tests from k/k in-tree to this cloud-provider-gcp repository. Migrated from file: https://github.com/kubernetes/kubernetes/blob/release-1.30/test/e2e/instrumentation/monitoring/accelerator.go

Status

Currently failing with pods from Nvidia driver daemonset timing out

Summarizing 6 Failures:
  [FAIL] [cloud-provider-gcp-e2e] Stackdriver Monitoring [It] should have accelerator metrics
  /home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:206
  [FAIL] [cloud-provider-gcp-e2e] GPUDevicePluginAcrossRecreate [It] run Nvidia GPU Device Plugin tests with a recreation
  /home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:206
  [FAIL] [cloud-provider-gcp-e2e] Device Plugin GPUs [It] run Nvidia GPU Device Plugin tests
  /home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:206
  [FAIL] [cloud-provider-gcp-e2e] GPU Upgrade cluster upgrade [It] should be able to run gpu pod after upgrade
  /home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:206
  [FAIL] [cloud-provider-gcp-e2e] GPU Upgrade master upgrade [It] should NOT disrupt gpu pod
  /home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:206
  [FAIL] [cloud-provider-gcp-e2e] GPU Upgrade cluster downgrade [It] should be able to run gpu pod after downgrade
  /home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:206

Ran 9 of 290 Specs in 192.499 seconds
FAIL! -- 3 Passed | 6 Failed | 0 Pending | 281 Skipped
 [FAILED] failed to get pods controlled by the nvidia-driver-installer daemonset: Timed out after 60.000s.
  expected at least 1 pods, only got 0

 I0529 17:55:52.560354 1857376 upgrade_context.go:86] Version for "ci/latest" is "v1.31.0-alpha.0.983+f44bb5e6e58c31\n"
  I0529 17:55:52.677652 1857376 nvidia-gpu.go:144] Nodename: kt2-1717029961633-master, OS Image: Container-Optimized OS from Google
  I0529 17:55:52.677662 1857376 nvidia-gpu.go:144] Nodename: kt2-1717029961633-minion-group-2hm0, OS Image: Container-Optimized OS from Google
  I0529 17:55:52.677664 1857376 nvidia-gpu.go:144] Nodename: kt2-1717029961633-minion-group-br6z, OS Image: Container-Optimized OS from Google
  I0529 17:55:52.677666 1857376 nvidia-gpu.go:144] Nodename: kt2-1717029961633-minion-group-d2g3, OS Image: Container-Optimized OS from Google
  I0529 17:55:52.677669 1857376 nvidia-gpu.go:101] Using default local nvidia-driver-installer daemonset manifest.
  I0529 17:55:52.743577 1857376 nvidia-gpu.go:112] Successfully created daemonset to install Nvidia drivers.
  I0529 17:56:52.803452 1857376 nvidia-gpu.go:115] Failed inside E2E framework:
      k8s.io/kubernetes/test/e2e/framework/pod.WaitForPods({0x7b1611463800, 0xc0005fc2d0}, {0x43623f0, 0xc0013008c0}, {0xc000aac7b0, 0x2f}, {{{0x0, 0x0}, {0x0, 0x0}}, ...}, ...)
      	/home/sean/go/pkg/mod/k8s.io/[email protected]/test/e2e/framework/pod/wait.go:327 +0x625
      k8s.io/kubernetes/test/e2e/framework/pod.WaitForPodsWithLabel({0x7b1611463800, 0xc0005fc2d0}, {0x43623f0, 0xc0013008c0}, {0xc000aac7b0, 0x2f}, {0x433ace0?, 0xc000aea8a0?})
      	/home/sean/go/pkg/mod/k8s.io/[email protected]/test/e2e/framework/pod/wait.go:657 +0x119
      k8s.io/kubernetes/test/e2e/framework/resource.WaitForControlledPods({0x7b1611463800, 0xc0005fc2d0}, {0x43623f0, 0xc0013008c0}, {0xc000aac7b0, 0x2f}, {0xc000161e60?, 0xc0008e13c0?}, {{0x3db2e53, 0xa}, ...})
      	/home/sean/go/pkg/mod/k8s.io/[email protected]/test/e2e/framework/resource/resources.go:249 +0xd8
      k8s.io/cloud-provider-gcp/tests/e2e.SetupNVIDIAGPUNode({0x7b1611463800, 0xc0005fc2d0}, 0xc0010a0000, 0x0)
      	/home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:114 +0x4cc
      k8s.io/cloud-provider-gcp/tests/e2e.(*NvidiaGPUUpgradeTest).Setup(0xc000fd7728?, {0x7b1611463800, 0xc0005fc2d0}, 0xc0010a0000)
      	/home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:61 +0x2d
      k8s.io/cloud-provider-gcp/tests/e2e.(*chaosMonkeyAdapter).Test(0xc001069480, {0x7b1611463800, 0xc0005fc2d0}, 0xc00095c1e0)
      	/home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/gpu.go:183 +0x1ce
      k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do.func1()
      	/home/sean/go/pkg/mod/k8s.io/[email protected]/test/e2e/chaosmonkey/chaosmonkey.go:95 +0x6c
      created by k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do in goroutine 125
      	/home/sean/go/pkg/mod/k8s.io/[email protected]/test/e2e/chaosmonkey/chaosmonkey.go:92 +0xa5
  I0529 17:56:52.803522 1857376 util.go:650] Running ../../cluster/gce/upgrade.sh [-M v1.31.0-alpha.0.983+f44bb5e6e58c31]
  I0529 17:57:05.616098 1857376 upgrade_mechanics.go:40] Unexpected error: 
      <*errors.errorString | 0xc0006c7860>: 
      error running ../../cluster/gce/upgrade.sh [-M v1.31.0-alpha.0.983+f44bb5e6e58c31]; got error exit status 1, stdout "Fetching the previously installed CoreDNS version\nThe default etcd storage media type in 1.6 has changed from application/json to application/vnd.kubernetes.protobuf.\nDocumentation about the change can be found at https://kubernetes.io/docs/admin/etcd_upgrade.\n\nETCD2 DOES NOT SUPPORT PROTOBUF: If you wish to have to ability to downgrade to etcd2 later application/json must be used.\n\nIt's HIGHLY recommended that etcd be backed up before this step!!\n\nTo enable using json, before running this script set:\nexport STORAGE_MEDIA_TYPE=application/json\n\nTo enable using protobuf, before running this script set:\nexport STORAGE_MEDIA_TYPE=application/vnd.kubernetes.protobuf\n\n", stderr "Using image: cos-109-17800-218-37 from project: cos-cloud as master image\nUsing image: cos-109-17800-218-37 from project: cos-cloud as node image\nUsing image: cos-109-17800-218-37 from project: cos-cloud as master image\nUsing image: cos-109-17800-218-37 from project: cos-cloud as node image\nUsing image: cos-109-17800-218-37 from project: cos-cloud as master image\nUsing image: cos-109-17800-218-37 from project: cos-cloud as node image\nSTORAGE_MEDIA_TYPE must be specified when run non-interactively.\n"
      {
          s: "error running ../../cluster/gce/upgrade.sh [-M v1.31.0-alpha.0.983+f44bb5e6e58c31]; got error exit status 1, stdout \"Fetching the previously installed CoreDNS version\\nThe default etcd storage media type in 1.6 has changed from application/json to application/vnd.kubernetes.protobuf.\\nDocumentation about the change can be found at https://kubernetes.io/docs/admin/etcd_upgrade.\\n\\nETCD2 DOES NOT SUPPORT PROTOBUF: If you wish to have to ability to downgrade to etcd2 later application/json must be used.\\n\\nIt's HIGHLY recommended that etcd be backed up before this step!!\\n\\nTo enable using json, before running this script set:\\nexport STORAGE_MEDIA_TYPE=application/json\\n\\nTo enable using protobuf, before running this script set:\\nexport STORAGE_MEDIA_TYPE=application/vnd.kubernetes.protobuf\\n\\n\", stderr \"Using image: cos-109-17800-218-37 from project: cos-cloud as master image\\nUsing image: cos-109-17800-218-37 from project: cos-cloud as node image\\nUsing image: cos-109-17800-218-37 from project: cos-cloud as master image\\nUsing image: cos-109-17800-218-37 from project: cos-cloud as node image\\nUsing image: cos-109-17800-218-37 from project: cos-cloud as master image\\nUsing image: cos-109-17800-218-37 from project: cos-cloud as node image\\nSTORAGE_MEDIA_TYPE must be specified when run non-interactively.\\n\"",
      }
  [FAILED] in [It] - /home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:115 @ 05/29/24 17:57:05.617
  STEP: Destroying namespace "nvidia-gpu-upgrade-sig-node-sig-scheduling-5395" for this suite. @ 05/29/24 17:57:05.618
  STEP: Destroying namespace "gpu-upgrade-5929" for this suite. @ 05/29/24 17:57:05.687

seans3 avatar May 30 '24 01:05 seans3