terraform-google-kubernetes-engine
terraform-google-kubernetes-engine copied to clipboard
feat: add support for gpu_sharing_config on nodepool
Fixes #1506
@ericyz Would it be possible to get a review on this PR?
@apeabody do you have time to look at this?
/gcbrun
Integration tests results:
-----> Setting up <node-pool-local>...
Finished setting up <node-pool-local> (0m0.00s).
-----> Verifying <node-pool-local>...
$$$$$$ Reading the Terraform input variables from the Kitchen instance state...
$$$$$$ Finished reading the Terraform input variables from the Kitchen instance state.
$$$$$$ Reading the Terraform output variables from the Kitchen instance state...
$$$$$$ Finished reading the Terraform output variables from the Kitchen instance state.
$$$$$$ Verifying the systems...
$$$$$$ Verifying the 'node_pool' system...
WARN: Unresolved or ambiguous specs during Gem::Specification.reset:
racc (>= 0)
Available/installed versions of this gem:
- 1.7.3
- 1.6.2
WARN: Clearing out unresolved specs. Try 'gem cleanup <gem>'
Please report a bug if this causes problems.
Profile: node_pool
Version: (not specified)
Target: local://
Target ID: ad86e15389e971002617c67a58841346aa7359db6f29de507ad5c7314325519e
× gcloud: Google Compute Engine GKE configuration (1 failed)
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` exit_status is expected to eq 0
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` stderr is expected to eq ""
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` cluster-autoscaling has the expected cluster autoscaling settings
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools has 3
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 exists
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 is the expected machine type
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has the expected image type
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has autoscaling enabled
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has the expected minimum node count
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has autorepair enabled
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has automatic upgrades enabled
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has the expected metadata
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has the expected labels
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has the expected network tags
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has the expected linux node config sysctls
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 exists
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 is the expected machine type
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has autoscaling enabled
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected minimum node count
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected maximum node count
× Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected accelerators
expected [{"config" => {"diskSizeGb" => 100, "diskType" => "pd-balanced", "imageType" => "COS_CONTAINERD", "loggingCon...RUNNING", "upgradeSettings" => {"maxSurge" => 1, "strategy" => "SURGE"}, "version" => "1.27.8-gke.1067004"}] to include (including {"name" => "pool-02", "config" => (including {"accelerators" => [{"acceleratorCount" => "1", "acceleratorType" => "nvidia-tesla-p4"}]})})
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected disk size
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected disk type
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected image type
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected labels
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected network tags
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected linux node config sysctls
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 exists
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 is the expected machine type
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has autoscaling disabled
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has the expected node count
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has autorepair enabled
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has automatic upgrades enabled
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has the expected labels
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has the expected network tags
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has the expected pod range
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has the expected image
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has the expected kubelet config
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has the expected linux node config sysctls
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` pool-03 has nodes in correct locations
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` exit_status is expected to eq 0
✔ Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` stderr is expected to eq ""
✔ kubectl: Kubernetes configuration
✔ kubernetes nodes pool-01 has the expected taints
✔ kubernetes nodes pool-02 has the expected taints
✔ kubernetes nodes pool-03 has the expected taints
Profile Summary: 1 successful control, 1 control failure, 0 controls skipped
Test Summary: 44 successful, 1 failure, 0 skipped
>>>>>> Verifying the 'node_pool' system failed:
Running InSpec failed:
Running InSpec failed due to a non-zero exit code of 1.
OK, I should apparently not have updated the examples. Didn't realize they were part of the tests. Will revert those changes.
OK, I should apparently not have updated the examples. Didn't realize they were part of the tests. Will revert those changes.
Thanks @jimgus - Alternatively, you could update the tests to match the updated examples.
/gcbrun
This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days
It seems like @jimgus has not been active here a couple months 😭 but it would be amazing to get this over the line though..
@apeabody would it be possible to finish this off? I'd be happy submit a new PR as well, of course. Whatever helps this move forward, without taking away from Jim's contribution ideally
/gcbrun
It seems like @jimgus has not been active here a couple months 😭 but it would be amazing to get this over the line though..
@apeabody would it be possible to finish this off? I'd be happy submit a new PR as well, of course. Whatever helps this move forward, without taking away from Jim's contribution ideally
Hi @SamuZad - Let me run the tests on this PR to see if it could be chained with a second PR.
Sorry for not being able to work on this issue for a while. I made update to the README as suggested by @apeabody
/gcbrun
/gcbrun
Going to re-trigger the test:
Step #32 - "apply simple-regional-with-networking-local": TestSimpleRegionalWithNetworking 2024-05-23T19:58:50Z command.go:185:
Step #32 - "apply simple-regional-with-networking-local": TestSimpleRegionalWithNetworking 2024-05-23T19:58:50Z command.go:185: Error: NodePool default-node-pool was created in the error state "ERROR"
Step #32 - "apply simple-regional-with-networking-local": TestSimpleRegionalWithNetworking 2024-05-23T19:58:50Z command.go:185:
Step #32 - "apply simple-regional-with-networking-local": TestSimpleRegionalWithNetworking 2024-05-23T19:58:50Z command.go:185: with module.example.module.gke.google_container_node_pool.pools["default-node-pool"],
Step #32 - "apply simple-regional-with-networking-local": TestSimpleRegionalWithNetworking 2024-05-23T19:58:50Z command.go:185: on ../../../cluster.tf line 462, in resource "google_container_node_pool" "pools":
Step #32 - "apply simple-regional-with-networking-local": TestSimpleRegionalWithNetworking 2024-05-23T19:58:50Z command.go:185: 462: resource "google_container_node_pool" "pools" {
Step #32 - "apply simple-regional-with-networking-local": TestSimpleRegionalWithNetworking 2024-05-23T19:58:50Z command.go:185:
Step #32 - "apply simple-regional-with-networking-local": TestSimpleRegionalWithNetworking 2024-05-23T19:58:50Z retry.go:99: Returning due to fatal error: FatalError{Underlying: error while running command: exit status 1;
/gcbrun
confirmed gpu_sharing_config is present in tpg v5.9.0
Thank you both so much for the swift turnaround! You have my (and others') eternal gratitude