Migrate e2e tests for GPUs
-
Migrates 3 GPU upgrade/downgrade e2e tests from k/k in-tree to this
cloud-provider-gcprepository. Migrated from file: https://github.com/kubernetes/kubernetes/blob/release-1.30/test/e2e/cloud/gcp/node/gpu.go -
Migrates 2 Nvidia GPU e2e tests from k/k in-tree to this
cloud-provider-gcprepository. Migrated from file: https://github.com/kubernetes/kubernetes/blob/release-1.30/test/e2e/scheduling/nvidia-gpus.go -
Migrates 1 StackDriver instrumentation e2e tests from k/k in-tree to this
cloud-provider-gcprepository. Migrated from file: https://github.com/kubernetes/kubernetes/blob/release-1.30/test/e2e/instrumentation/monitoring/accelerator.go
Status
Currently failing with pods from Nvidia driver daemonset timing out
Summarizing 6 Failures:
[FAIL] [cloud-provider-gcp-e2e] Stackdriver Monitoring [It] should have accelerator metrics
/home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:206
[FAIL] [cloud-provider-gcp-e2e] GPUDevicePluginAcrossRecreate [It] run Nvidia GPU Device Plugin tests with a recreation
/home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:206
[FAIL] [cloud-provider-gcp-e2e] Device Plugin GPUs [It] run Nvidia GPU Device Plugin tests
/home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:206
[FAIL] [cloud-provider-gcp-e2e] GPU Upgrade cluster upgrade [It] should be able to run gpu pod after upgrade
/home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:206
[FAIL] [cloud-provider-gcp-e2e] GPU Upgrade master upgrade [It] should NOT disrupt gpu pod
/home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:206
[FAIL] [cloud-provider-gcp-e2e] GPU Upgrade cluster downgrade [It] should be able to run gpu pod after downgrade
/home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:206
Ran 9 of 290 Specs in 192.499 seconds
FAIL! -- 3 Passed | 6 Failed | 0 Pending | 281 Skipped
[FAILED] failed to get pods controlled by the nvidia-driver-installer daemonset: Timed out after 60.000s.
expected at least 1 pods, only got 0
I0529 17:55:52.560354 1857376 upgrade_context.go:86] Version for "ci/latest" is "v1.31.0-alpha.0.983+f44bb5e6e58c31\n"
I0529 17:55:52.677652 1857376 nvidia-gpu.go:144] Nodename: kt2-1717029961633-master, OS Image: Container-Optimized OS from Google
I0529 17:55:52.677662 1857376 nvidia-gpu.go:144] Nodename: kt2-1717029961633-minion-group-2hm0, OS Image: Container-Optimized OS from Google
I0529 17:55:52.677664 1857376 nvidia-gpu.go:144] Nodename: kt2-1717029961633-minion-group-br6z, OS Image: Container-Optimized OS from Google
I0529 17:55:52.677666 1857376 nvidia-gpu.go:144] Nodename: kt2-1717029961633-minion-group-d2g3, OS Image: Container-Optimized OS from Google
I0529 17:55:52.677669 1857376 nvidia-gpu.go:101] Using default local nvidia-driver-installer daemonset manifest.
I0529 17:55:52.743577 1857376 nvidia-gpu.go:112] Successfully created daemonset to install Nvidia drivers.
I0529 17:56:52.803452 1857376 nvidia-gpu.go:115] Failed inside E2E framework:
k8s.io/kubernetes/test/e2e/framework/pod.WaitForPods({0x7b1611463800, 0xc0005fc2d0}, {0x43623f0, 0xc0013008c0}, {0xc000aac7b0, 0x2f}, {{{0x0, 0x0}, {0x0, 0x0}}, ...}, ...)
/home/sean/go/pkg/mod/k8s.io/[email protected]/test/e2e/framework/pod/wait.go:327 +0x625
k8s.io/kubernetes/test/e2e/framework/pod.WaitForPodsWithLabel({0x7b1611463800, 0xc0005fc2d0}, {0x43623f0, 0xc0013008c0}, {0xc000aac7b0, 0x2f}, {0x433ace0?, 0xc000aea8a0?})
/home/sean/go/pkg/mod/k8s.io/[email protected]/test/e2e/framework/pod/wait.go:657 +0x119
k8s.io/kubernetes/test/e2e/framework/resource.WaitForControlledPods({0x7b1611463800, 0xc0005fc2d0}, {0x43623f0, 0xc0013008c0}, {0xc000aac7b0, 0x2f}, {0xc000161e60?, 0xc0008e13c0?}, {{0x3db2e53, 0xa}, ...})
/home/sean/go/pkg/mod/k8s.io/[email protected]/test/e2e/framework/resource/resources.go:249 +0xd8
k8s.io/cloud-provider-gcp/tests/e2e.SetupNVIDIAGPUNode({0x7b1611463800, 0xc0005fc2d0}, 0xc0010a0000, 0x0)
/home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:114 +0x4cc
k8s.io/cloud-provider-gcp/tests/e2e.(*NvidiaGPUUpgradeTest).Setup(0xc000fd7728?, {0x7b1611463800, 0xc0005fc2d0}, 0xc0010a0000)
/home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:61 +0x2d
k8s.io/cloud-provider-gcp/tests/e2e.(*chaosMonkeyAdapter).Test(0xc001069480, {0x7b1611463800, 0xc0005fc2d0}, 0xc00095c1e0)
/home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/gpu.go:183 +0x1ce
k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do.func1()
/home/sean/go/pkg/mod/k8s.io/[email protected]/test/e2e/chaosmonkey/chaosmonkey.go:95 +0x6c
created by k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do in goroutine 125
/home/sean/go/pkg/mod/k8s.io/[email protected]/test/e2e/chaosmonkey/chaosmonkey.go:92 +0xa5
I0529 17:56:52.803522 1857376 util.go:650] Running ../../cluster/gce/upgrade.sh [-M v1.31.0-alpha.0.983+f44bb5e6e58c31]
I0529 17:57:05.616098 1857376 upgrade_mechanics.go:40] Unexpected error:
<*errors.errorString | 0xc0006c7860>:
error running ../../cluster/gce/upgrade.sh [-M v1.31.0-alpha.0.983+f44bb5e6e58c31]; got error exit status 1, stdout "Fetching the previously installed CoreDNS version\nThe default etcd storage media type in 1.6 has changed from application/json to application/vnd.kubernetes.protobuf.\nDocumentation about the change can be found at https://kubernetes.io/docs/admin/etcd_upgrade.\n\nETCD2 DOES NOT SUPPORT PROTOBUF: If you wish to have to ability to downgrade to etcd2 later application/json must be used.\n\nIt's HIGHLY recommended that etcd be backed up before this step!!\n\nTo enable using json, before running this script set:\nexport STORAGE_MEDIA_TYPE=application/json\n\nTo enable using protobuf, before running this script set:\nexport STORAGE_MEDIA_TYPE=application/vnd.kubernetes.protobuf\n\n", stderr "Using image: cos-109-17800-218-37 from project: cos-cloud as master image\nUsing image: cos-109-17800-218-37 from project: cos-cloud as node image\nUsing image: cos-109-17800-218-37 from project: cos-cloud as master image\nUsing image: cos-109-17800-218-37 from project: cos-cloud as node image\nUsing image: cos-109-17800-218-37 from project: cos-cloud as master image\nUsing image: cos-109-17800-218-37 from project: cos-cloud as node image\nSTORAGE_MEDIA_TYPE must be specified when run non-interactively.\n"
{
s: "error running ../../cluster/gce/upgrade.sh [-M v1.31.0-alpha.0.983+f44bb5e6e58c31]; got error exit status 1, stdout \"Fetching the previously installed CoreDNS version\\nThe default etcd storage media type in 1.6 has changed from application/json to application/vnd.kubernetes.protobuf.\\nDocumentation about the change can be found at https://kubernetes.io/docs/admin/etcd_upgrade.\\n\\nETCD2 DOES NOT SUPPORT PROTOBUF: If you wish to have to ability to downgrade to etcd2 later application/json must be used.\\n\\nIt's HIGHLY recommended that etcd be backed up before this step!!\\n\\nTo enable using json, before running this script set:\\nexport STORAGE_MEDIA_TYPE=application/json\\n\\nTo enable using protobuf, before running this script set:\\nexport STORAGE_MEDIA_TYPE=application/vnd.kubernetes.protobuf\\n\\n\", stderr \"Using image: cos-109-17800-218-37 from project: cos-cloud as master image\\nUsing image: cos-109-17800-218-37 from project: cos-cloud as node image\\nUsing image: cos-109-17800-218-37 from project: cos-cloud as master image\\nUsing image: cos-109-17800-218-37 from project: cos-cloud as node image\\nUsing image: cos-109-17800-218-37 from project: cos-cloud as master image\\nUsing image: cos-109-17800-218-37 from project: cos-cloud as node image\\nSTORAGE_MEDIA_TYPE must be specified when run non-interactively.\\n\"",
}
[FAILED] in [It] - /home/sean/go/src/k8s.io/cloud-provider-gcp/test/e2e/nvidia-gpu.go:115 @ 05/29/24 17:57:05.617
STEP: Destroying namespace "nvidia-gpu-upgrade-sig-node-sig-scheduling-5395" for this suite. @ 05/29/24 17:57:05.618
STEP: Destroying namespace "gpu-upgrade-5929" for this suite. @ 05/29/24 17:57:05.687
This issue is currently awaiting triage.
If the repository mantainers determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.
The triage/accepted label can be added by org members by writing /triage accepted in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/assign @BenTheElder
/test pull-cloud-provider-gcp-e2e
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: seans3 Once this PR has been reviewed and has the lgtm label, please ask for approval from bentheelder. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/test pull-cloud-provider-gcp-e2e
@seans3: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| pull-cloud-provider-gcp-e2e | 8a4d4954169a26e8da89833f7a62aaadd2a6760c | link | false | /test pull-cloud-provider-gcp-e2e |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.
/hold
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle stale - Close this PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle rotten - Close this PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
We have new generic nvidia GPU tests upstream now on AWS + GCP.
maybe we don't want to maintain these anymore
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Reopen this PR with
/reopen - Mark this PR as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closed this PR.
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closedYou can:
- Reopen this PR with
/reopen- Mark this PR as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.