terraform-google-kubernetes-engine icon indicating copy to clipboard operation
terraform-google-kubernetes-engine copied to clipboard

fix: Add the nodepool cgroup mode to the NAP config

Open davidholsgrove opened this issue 6 months ago • 24 comments

Closes #2321

Add the nodepool cgroup mode to the NAP config

davidholsgrove avatar May 22 '25 23:05 davidholsgrove

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

google-cla[bot] avatar May 22 '25 23:05 google-cla[bot]

Hi @davidholsgrove - Can you please check the CLA. Thanks!

apeabody avatar May 29 '25 18:05 apeabody

CLA updated. Thanks @apeabody

davidholsgrove avatar May 30 '25 05:05 davidholsgrove

/gcbrun

apeabody avatar May 30 '25 22:05 apeabody

Hi @davidholsgrove - From the CI Tests:

        	Error:      	Received unexpected error:
        	            	FatalError{Underlying: error while running command: exit status 1; 
        	            	Error: Error updating LinuxNodeConfig: googleapi: Error 400: Setting cgroup_mode to v1 is not allowed for clusters created with version 1.32.3-gke.1927009.
        	            	Details:
        	            	[
        	            	  {
        	            	    "@type": "type.googleapis.com/google.rpc.RequestInfo",
        	            	    "requestId": "0x261042fcc5e9d554"
        	            	  }
        	            	]
        	            	, badRequest
        	            	
        	            	  with module.example.module.gke.google_container_cluster.primary,
        	            	  on ../../../modules/beta-public-cluster/cluster.tf line 22, in resource "google_container_cluster" "primary":
        	            	  22: resource "google_container_cluster" "primary" {
        	            	}
        	Test:       	TestNodePool

apeabody avatar May 30 '25 22:05 apeabody

Hi @apeabody

I think the testcase was updated in https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/pull/2224 to downgrade from CGROUP_MODE_V2 to CGROUP_MODE_V1.

I'm not sure from https://cloud.google.com/kubernetes-engine/docs/how-to/migrate-cgroupv2#transition-plan what the default CGROUP is for clusters created with 1.32, but it looks like from the error message in the testcases that its now V2, and the API is rejecting an attempt to set to V1?

I dont have access to the cloudbuild logs to see further than the quote in your reply apologies.

If the testcase needs to revert this diff, I can include it in my PR?

davidholsgrove avatar Jun 02 '25 01:06 davidholsgrove

Hi @apeabody

I think the testcase was updated in #2224 to downgrade from CGROUP_MODE_V2 to CGROUP_MODE_V1.

I'm not sure from https://cloud.google.com/kubernetes-engine/docs/how-to/migrate-cgroupv2#transition-plan what the default CGROUP is for clusters created with 1.32, but it looks like from the error message in the testcases that its now V2, and the API is rejecting an attempt to set to V1?

I dont have access to the cloudbuild logs to see further than the quote in your reply apologies.

If the testcase needs to revert this diff, I can include it in my PR?

Hi @davidholsgrove!

Yes, the API is rejecting when specifically adding to a new the NAP, full removal of CGROUP_MODE_V1 isn't expected till v1.35.

I suggest swapping that values of pool-01 and all (and their test assets). That should fix this test, while maintaining test coverage. https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/main/examples/node_pool/main.tf#L165

apeabody avatar Jun 02 '25 17:06 apeabody

/gcbrun

davidholsgrove avatar Jun 03 '25 03:06 davidholsgrove

Thanks @apeabody - testcases updated to flip the default to V2, with the exception of pool-01 to continue coverage. I dont have perms to trigger a retest, could you run that for me and let me know if you need anything else?

davidholsgrove avatar Jun 03 '25 03:06 davidholsgrove

/gcbrun

apeabody avatar Jun 03 '25 16:06 apeabody

I'll re-run the test again:

        	Error:      	Received unexpected error:
        	            	FatalError{Underlying: error while running command: exit status 1; 
        	            	Error: Error updating LinuxNodeConfig: googleapi: Error 400: Cluster is running incompatible operation operation-1748969188467-eb201d3b-0103-4996-89d9-a6cf6fa5ed15.
        	            	Details:
        	            	[
        	            	  {
        	            	    "@type": "type.googleapis.com/google.rpc.RequestInfo",
        	            	    "requestId": "0x7d2092479dc8c224"
        	            	  },
        	            	  {
        	            	    "@type": "type.googleapis.com/google.rpc.ErrorInfo",
        	            	    "domain": "container.googleapis.com",
        	            	    "reason": "CLUSTER_ALREADY_HAS_OPERATION"
        	            	  }
        	            	]
        	            	, failedPrecondition
        	            	
        	            	  with module.example.module.gke.google_container_cluster.primary,
        	            	  on ../../../modules/beta-public-cluster/cluster.tf line 22, in resource "google_container_cluster" "primary":
        	            	  22: resource "google_container_cluster" "primary" {
        	            	}
        	Test:       	TestNodePool

apeabody avatar Jun 03 '25 17:06 apeabody

Thanks @apeabody - I'm not sure how to resolve that failure without access to run the tests / dig deeper. Looks unrelated to the change in question though?

davidholsgrove avatar Jun 04 '25 03:06 davidholsgrove

Thanks @apeabody - I'm not sure how to resolve that failure without access to run the tests / dig deeper. Looks unrelated to the change in question though?

Looks like we are still seeing in the re-test, let me check on a few PR that don't include this change.

apeabody avatar Jun 04 '25 16:06 apeabody

Thanks @apeabody - I'm not sure how to resolve that failure without access to run the tests / dig deeper. Looks unrelated to the change in question though?

Looks like we are still seeing in the re-test, let me check on a few PR that don't include this change.

Hi @davidholsgrove - I've checked and this error seems to be specific to this change. I'd suggest (temporarily) removing pool-01 = "CGROUP_MODE_V1" as I'm wondering if the linux_node_config is first set as "CGROUP_MODE_V1" by all, the pool-01 can't be later set to "CGROUP_MODE_V1"?

apeabody avatar Jun 04 '25 18:06 apeabody

Thanks @apeabody - rebased and removed the explicit setting of pool-01 = "CGROUP_MODE_V1" from the testcase. Could you trigger again?

davidholsgrove avatar Jun 09 '25 00:06 davidholsgrove

/gcbrun

apeabody avatar Jun 09 '25 16:06 apeabody

Thanks @apeabody - rebased and removed the explicit setting of pool-01 = "CGROUP_MODE_V1" from the testcase. Could you trigger again?

Hi @davidholsgrove - Yeah, it's still having the issue:

Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: Error: Error updating LinuxNodeConfig: googleapi: Error 400: Cluster is running incompatible operation operation-1749488222583-0c77b4da-5a67-4e0b-94ab-ad5f6a673820.
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: Details:
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: [
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   {
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:     "@type": "type.googleapis.com/google.rpc.RequestInfo",
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:     "requestId": "0x663f464965d3f2fb"
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   },
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   {
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:     "@type": "type.googleapis.com/google.rpc.ErrorInfo",
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:     "domain": "container.googleapis.com",
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:     "reason": "CLUSTER_ALREADY_HAS_OPERATION"
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   }
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: ]
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: , failedPrecondition
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: 
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   with module.example.module.gke.google_container_cluster.primary,
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   on ../../../modules/beta-public-cluster/cluster.tf line 22, in resource "google_container_cluster" "primary":
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   22: resource "google_container_cluster" "primary" {
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: 
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z retry.go:99: Returning due to fatal error: FatalError{Underlying: error while running command: exit status 1; 

Looking at your change, another possibility is the addition of local.node_pools_cgroup_mode != null to the node_pool_auto_config gating function is also resulting in the network_tags blocks being created. I would suggest making network_tags a dynamic block as well, gated on var.cluster_autoscaling.enabled && (length(var.network_tags) > 0 || var.add_cluster_firewall_rules to eliminate that potential.

apeabody avatar Jun 09 '25 17:06 apeabody

Hi @apeabody, Apologies for the delay - thanks for the suggestion, yes makes sense to me. I've rebased to include the v37 / main latest. If you could kick off the tests again that would be great. thanks, David

davidholsgrove avatar Jun 18 '25 23:06 davidholsgrove

/gcbrun

apeabody avatar Jun 18 '25 23:06 apeabody

Hi @apeabody, Apologies for the delay - thanks for the suggestion, yes makes sense to me. I've rebased to include the v37 / main latest. If you could kick off the tests again that would be great. thanks, David

Great, that got further, this is from the test itself:

Step #55 - "verify node-pool-local":         	Error:      	Not equal: 
Step #55 - "verify node-pool-local":         	            	expected: "EFFECTIVE_CGROUP_MODE_V1"
Step #55 - "verify node-pool-local":         	            	actual  : "EFFECTIVE_CGROUP_MODE_V2"
Step #55 - "verify node-pool-local":         	            	
Step #55 - "verify node-pool-local":         	            	Diff:
Step #55 - "verify node-pool-local":         	            	--- Expected
Step #55 - "verify node-pool-local":         	            	+++ Actual
Step #55 - "verify node-pool-local":         	            	@@ -1 +1 @@
Step #55 - "verify node-pool-local":         	            	-EFFECTIVE_CGROUP_MODE_V1
Step #55 - "verify node-pool-local":         	            	+EFFECTIVE_CGROUP_MODE_V2
Step #55 - "verify node-pool-local":         	Test:       	TestNodePool
Step #55 - "verify node-pool-local":         	Messages:   	For node "pool-01" path "config.effectiveCgroupMode" expected "EFFECTIVE_CGROUP_MODE_V2" to match fixture "EFFECTIVE_CGROUP_MODE_V1"

apeabody avatar Jun 19 '25 00:06 apeabody

Thanks again @apeabody - its hard to guess the test logic here without being able to trigger or see the logs.

I've updated the node_pools_cgroup_mode in examples/node_pool/main.tf to include explicitly setting node pool-01 to CGROUP_MODE_V1 to maintain that test.

Could we try again?

davidholsgrove avatar Jun 19 '25 03:06 davidholsgrove

/gcbrun

apeabody avatar Jun 20 '25 17:06 apeabody

Thanks again @apeabody - its hard to guess the test logic here without being able to trigger or see the logs.

I've updated the node_pools_cgroup_mode in examples/node_pool/main.tf to include explicitly setting node pool-01 to CGROUP_MODE_V1 to maintain that test.

Could we try again?

Yes, this indicates that the observed actual values does not match the test data here: https://github.com/davidholsgrove/terraform-google-kubernetes-engine/blob/nap-cgroup-mode/test/integration/node_pool/testdata/TestNodePool.json

apeabody avatar Jun 20 '25 17:06 apeabody

OK, this error is from the cluster, rather than a specific nodepool:

Step #55 - "verify node-pool-local":         	Error:      	Not equal: 
Step #55 - "verify node-pool-local":         	            	expected: "EFFECTIVE_CGROUP_MODE_V1"
Step #55 - "verify node-pool-local":         	            	actual  : "EFFECTIVE_CGROUP_MODE_V2"
Step #55 - "verify node-pool-local":         	            	
Step #55 - "verify node-pool-local":         	            	Diff:
Step #55 - "verify node-pool-local":         	            	--- Expected
Step #55 - "verify node-pool-local":         	            	+++ Actual
Step #55 - "verify node-pool-local":         	            	@@ -1 +1 @@
Step #55 - "verify node-pool-local":         	            	-EFFECTIVE_CGROUP_MODE_V1
Step #55 - "verify node-pool-local":         	            	+EFFECTIVE_CGROUP_MODE_V2
Step #55 - "verify node-pool-local":         	Test:       	TestNodePool
Step #55 - "verify node-pool-local":         	Messages:   	For path "nodeConfig.effectiveCgroupMode" expected "EFFECTIVE_CGROUP_MODE_V2" to match fixture "EFFECTIVE_CGROUP_MODE_V1"

apeabody avatar Jun 20 '25 21:06 apeabody

Thanks @apeabody - returning to this PR, as we still see the same perma-diff with the latest releases of the module.

I think I've updated the expected result for the cluster nodepool effectiveCgroupMode. Could you rerun the tests?

davidholsgrove avatar Sep 05 '25 04:09 davidholsgrove

/gcbrun

apeabody avatar Sep 05 '25 16:09 apeabody