terraform-google-kubernetes-engine icon indicating copy to clipboard operation
terraform-google-kubernetes-engine copied to clipboard

feat: add support for gpu_sharing_config on nodepool

Open jimgus opened this issue 1 year ago • 7 comments

Fixes #1506

jimgus avatar Feb 13 '24 11:02 jimgus

@ericyz Would it be possible to get a review on this PR?

jimgus avatar Feb 27 '24 07:02 jimgus

@apeabody do you have time to look at this?

nissessenap avatar Mar 13 '24 13:03 nissessenap

/gcbrun

apeabody avatar Mar 13 '24 16:03 apeabody

Integration tests results:

-----> Setting up <node-pool-local>...
       Finished setting up <node-pool-local> (0m0.00s).
-----> Verifying <node-pool-local>...
$$$$$$ Reading the Terraform input variables from the Kitchen instance state...
$$$$$$ Finished reading the Terraform input variables from the Kitchen instance state.
$$$$$$ Reading the Terraform output variables from the Kitchen instance state...
$$$$$$ Finished reading the Terraform output variables from the Kitchen instance state.
$$$$$$ Verifying the systems...
$$$$$$ Verifying the 'node_pool' system...
WARN: Unresolved or ambiguous specs during Gem::Specification.reset:
      racc (>= 0)
      Available/installed versions of this gem:
      - 1.7.3
      - 1.6.2
WARN: Clearing out unresolved specs. Try 'gem cleanup <gem>'
Please report a bug if this causes problems.

Profile:   node_pool
Version:   (not specified)
Target:    local://
Target ID: ad86e15389e971002617c67a58841346aa7359db6f29de507ad5c7314325519e

  ×  gcloud: Google Compute Engine GKE configuration (1 failed)
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` exit_status is expected to eq 0
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` stderr is expected to eq ""
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` cluster-autoscaling has the expected cluster autoscaling settings
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools has 3
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 exists
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 is the expected machine type
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has the expected image type
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has autoscaling enabled
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has the expected minimum node count
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has autorepair enabled
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has automatic upgrades enabled
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has the expected metadata
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has the expected labels
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has the expected network tags
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has the expected linux node config sysctls
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 exists
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 is the expected machine type
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has autoscaling enabled
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected minimum node count
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected maximum node count
     ×  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected accelerators
     expected [{"config" => {"diskSizeGb" => 100, "diskType" => "pd-balanced", "imageType" => "COS_CONTAINERD", "loggingCon...RUNNING", "upgradeSettings" => {"maxSurge" => 1, "strategy" => "SURGE"}, "version" => "1.27.8-gke.1067004"}] to include (including {"name" => "pool-02", "config" => (including {"accelerators" => [{"acceleratorCount" => "1", "acceleratorType" => "nvidia-tesla-p4"}]})})
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected disk size
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected disk type
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected image type
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected labels
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected network tags
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected linux node config sysctls
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 exists
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 is the expected machine type
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has autoscaling disabled
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has the expected node count
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has autorepair enabled
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has automatic upgrades enabled
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has the expected labels
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has the expected network tags
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has the expected pod range
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has the expected image
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has the expected kubelet config
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has the expected linux node config sysctls
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` pool-03 has nodes in correct locations
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` exit_status is expected to eq 0
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` stderr is expected to eq ""
  ✔  kubectl: Kubernetes configuration
     ✔  kubernetes nodes pool-01 has the expected taints
     ✔  kubernetes nodes pool-02 has the expected taints
     ✔  kubernetes nodes pool-03 has the expected taints


Profile Summary: 1 successful control, 1 control failure, 0 controls skipped
Test Summary: 44 successful, 1 failure, 0 skipped
>>>>>> Verifying the 'node_pool' system failed:
	Running InSpec failed:
		Running InSpec failed due to a non-zero exit code of 1.

apeabody avatar Mar 13 '24 19:03 apeabody

OK, I should apparently not have updated the examples. Didn't realize they were part of the tests. Will revert those changes.

jimgus avatar Mar 13 '24 20:03 jimgus

OK, I should apparently not have updated the examples. Didn't realize they were part of the tests. Will revert those changes.

Thanks @jimgus - Alternatively, you could update the tests to match the updated examples.

apeabody avatar Mar 13 '24 20:03 apeabody

/gcbrun

apeabody avatar Mar 14 '24 04:03 apeabody

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days

github-actions[bot] avatar May 14 '24 23:05 github-actions[bot]

It seems like @jimgus has not been active here a couple months 😭 but it would be amazing to get this over the line though..

@apeabody would it be possible to finish this off? I'd be happy submit a new PR as well, of course. Whatever helps this move forward, without taking away from Jim's contribution ideally

SamuZad avatar May 22 '24 17:05 SamuZad

/gcbrun

apeabody avatar May 22 '24 20:05 apeabody

It seems like @jimgus has not been active here a couple months 😭 but it would be amazing to get this over the line though..

@apeabody would it be possible to finish this off? I'd be happy submit a new PR as well, of course. Whatever helps this move forward, without taking away from Jim's contribution ideally

Hi @SamuZad - Let me run the tests on this PR to see if it could be chained with a second PR.

apeabody avatar May 22 '24 20:05 apeabody

Sorry for not being able to work on this issue for a while. I made update to the README as suggested by @apeabody

jimgus avatar May 23 '24 07:05 jimgus

/gcbrun

apeabody avatar May 23 '24 16:05 apeabody

/gcbrun

apeabody avatar May 23 '24 19:05 apeabody

Going to re-trigger the test:

Step #32 - "apply simple-regional-with-networking-local": TestSimpleRegionalWithNetworking 2024-05-23T19:58:50Z command.go:185: 
Step #32 - "apply simple-regional-with-networking-local": TestSimpleRegionalWithNetworking 2024-05-23T19:58:50Z command.go:185: Error: NodePool default-node-pool was created in the error state "ERROR"
Step #32 - "apply simple-regional-with-networking-local": TestSimpleRegionalWithNetworking 2024-05-23T19:58:50Z command.go:185: 
Step #32 - "apply simple-regional-with-networking-local": TestSimpleRegionalWithNetworking 2024-05-23T19:58:50Z command.go:185:   with module.example.module.gke.google_container_node_pool.pools["default-node-pool"],
Step #32 - "apply simple-regional-with-networking-local": TestSimpleRegionalWithNetworking 2024-05-23T19:58:50Z command.go:185:   on ../../../cluster.tf line 462, in resource "google_container_node_pool" "pools":
Step #32 - "apply simple-regional-with-networking-local": TestSimpleRegionalWithNetworking 2024-05-23T19:58:50Z command.go:185:  462: resource "google_container_node_pool" "pools" {
Step #32 - "apply simple-regional-with-networking-local": TestSimpleRegionalWithNetworking 2024-05-23T19:58:50Z command.go:185: 
Step #32 - "apply simple-regional-with-networking-local": TestSimpleRegionalWithNetworking 2024-05-23T19:58:50Z retry.go:99: Returning due to fatal error: FatalError{Underlying: error while running command: exit status 1; 

apeabody avatar May 23 '24 21:05 apeabody

/gcbrun

apeabody avatar May 23 '24 23:05 apeabody

confirmed gpu_sharing_config is present in tpg v5.9.0

apeabody avatar May 24 '24 17:05 apeabody

Thank you both so much for the swift turnaround! You have my (and others') eternal gratitude

SamuZad avatar May 24 '24 23:05 SamuZad