terraform-google-kubernetes-engine fix: Add the nodepool cgroup mode to the NAP config

Closes #2321

Add the nodepool cgroup mode to the NAP config

May 22 '25 23:05 davidholsgrove

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

May 22 '25 23:05 google-cla[bot]

Hi @davidholsgrove - Can you please check the CLA. Thanks!

May 29 '25 18:05 apeabody

CLA updated. Thanks @apeabody

May 30 '25 05:05 davidholsgrove

/gcbrun

May 30 '25 22:05 apeabody

Hi @davidholsgrove - From the CI Tests:

        	Error:      	Received unexpected error:
        	            	FatalError{Underlying: error while running command: exit status 1; 
        	            	Error: Error updating LinuxNodeConfig: googleapi: Error 400: Setting cgroup_mode to v1 is not allowed for clusters created with version 1.32.3-gke.1927009.
        	            	Details:
        	            	[
        	            	  {
        	            	    "@type": "type.googleapis.com/google.rpc.RequestInfo",
        	            	    "requestId": "0x261042fcc5e9d554"
        	            	  }
        	            	]
        	            	, badRequest
        	            	
        	            	  with module.example.module.gke.google_container_cluster.primary,
        	            	  on ../../../modules/beta-public-cluster/cluster.tf line 22, in resource "google_container_cluster" "primary":
        	            	  22: resource "google_container_cluster" "primary" {
        	            	}
        	Test:       	TestNodePool

May 30 '25 22:05 apeabody

Hi @apeabody

I think the testcase was updated in https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/pull/2224 to downgrade from CGROUP_MODE_V2 to CGROUP_MODE_V1.

I'm not sure from https://cloud.google.com/kubernetes-engine/docs/how-to/migrate-cgroupv2#transition-plan what the default CGROUP is for clusters created with 1.32, but it looks like from the error message in the testcases that its now V2, and the API is rejecting an attempt to set to V1?

I dont have access to the cloudbuild logs to see further than the quote in your reply apologies.

If the testcase needs to revert this diff, I can include it in my PR?

Jun 02 '25 01:06 davidholsgrove

Hi @apeabody

I think the testcase was updated in #2224 to downgrade from CGROUP_MODE_V2 to CGROUP_MODE_V1.

I'm not sure from https://cloud.google.com/kubernetes-engine/docs/how-to/migrate-cgroupv2#transition-plan what the default CGROUP is for clusters created with 1.32, but it looks like from the error message in the testcases that its now V2, and the API is rejecting an attempt to set to V1?

I dont have access to the cloudbuild logs to see further than the quote in your reply apologies.

If the testcase needs to revert this diff, I can include it in my PR?

Hi @davidholsgrove!

Yes, the API is rejecting when specifically adding to a new the NAP, full removal of CGROUP_MODE_V1 isn't expected till v1.35.

I suggest swapping that values of pool-01 and all (and their test assets). That should fix this test, while maintaining test coverage. https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/main/examples/node_pool/main.tf#L165

Jun 02 '25 17:06 apeabody

/gcbrun

Jun 03 '25 03:06 davidholsgrove

Thanks @apeabody - testcases updated to flip the default to V2, with the exception of pool-01 to continue coverage. I dont have perms to trigger a retest, could you run that for me and let me know if you need anything else?

Jun 03 '25 03:06 davidholsgrove

/gcbrun

Jun 03 '25 16:06 apeabody

I'll re-run the test again:

        	Error:      	Received unexpected error:
        	            	FatalError{Underlying: error while running command: exit status 1; 
        	            	Error: Error updating LinuxNodeConfig: googleapi: Error 400: Cluster is running incompatible operation operation-1748969188467-eb201d3b-0103-4996-89d9-a6cf6fa5ed15.
        	            	Details:
        	            	[
        	            	  {
        	            	    "@type": "type.googleapis.com/google.rpc.RequestInfo",
        	            	    "requestId": "0x7d2092479dc8c224"
        	            	  },
        	            	  {
        	            	    "@type": "type.googleapis.com/google.rpc.ErrorInfo",
        	            	    "domain": "container.googleapis.com",
        	            	    "reason": "CLUSTER_ALREADY_HAS_OPERATION"
        	            	  }
        	            	]
        	            	, failedPrecondition
        	            	
        	            	  with module.example.module.gke.google_container_cluster.primary,
        	            	  on ../../../modules/beta-public-cluster/cluster.tf line 22, in resource "google_container_cluster" "primary":
        	            	  22: resource "google_container_cluster" "primary" {
        	            	}
        	Test:       	TestNodePool

Jun 03 '25 17:06 apeabody

Thanks @apeabody - I'm not sure how to resolve that failure without access to run the tests / dig deeper. Looks unrelated to the change in question though?

Jun 04 '25 03:06 davidholsgrove

Thanks @apeabody - I'm not sure how to resolve that failure without access to run the tests / dig deeper. Looks unrelated to the change in question though?

Looks like we are still seeing in the re-test, let me check on a few PR that don't include this change.

Jun 04 '25 16:06 apeabody

Thanks @apeabody - I'm not sure how to resolve that failure without access to run the tests / dig deeper. Looks unrelated to the change in question though?

Looks like we are still seeing in the re-test, let me check on a few PR that don't include this change.

Hi @davidholsgrove - I've checked and this error seems to be specific to this change. I'd suggest (temporarily) removing pool-01 = "CGROUP_MODE_V1" as I'm wondering if the linux_node_config is first set as "CGROUP_MODE_V1" by all, the pool-01 can't be later set to "CGROUP_MODE_V1"?

Jun 04 '25 18:06 apeabody

Thanks @apeabody - rebased and removed the explicit setting of pool-01 = "CGROUP_MODE_V1" from the testcase. Could you trigger again?

Jun 09 '25 00:06 davidholsgrove

/gcbrun

Jun 09 '25 16:06 apeabody

Thanks @apeabody - rebased and removed the explicit setting of pool-01 = "CGROUP_MODE_V1" from the testcase. Could you trigger again?

Hi @davidholsgrove - Yeah, it's still having the issue:

Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: Error: Error updating LinuxNodeConfig: googleapi: Error 400: Cluster is running incompatible operation operation-1749488222583-0c77b4da-5a67-4e0b-94ab-ad5f6a673820.
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: Details:
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: [
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   {
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:     "@type": "type.googleapis.com/google.rpc.RequestInfo",
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:     "requestId": "0x663f464965d3f2fb"
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   },
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   {
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:     "@type": "type.googleapis.com/google.rpc.ErrorInfo",
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:     "domain": "container.googleapis.com",
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:     "reason": "CLUSTER_ALREADY_HAS_OPERATION"
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   }
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: ]
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: , failedPrecondition
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: 
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   with module.example.module.gke.google_container_cluster.primary,
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   on ../../../modules/beta-public-cluster/cluster.tf line 22, in resource "google_container_cluster" "primary":
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   22: resource "google_container_cluster" "primary" {
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: 
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z retry.go:99: Returning due to fatal error: FatalError{Underlying: error while running command: exit status 1;

Looking at your change, another possibility is the addition of local.node_pools_cgroup_mode != null to the node_pool_auto_config gating function is also resulting in the network_tags blocks being created. I would suggest making network_tags a dynamic block as well, gated on var.cluster_autoscaling.enabled && (length(var.network_tags) > 0 || var.add_cluster_firewall_rules to eliminate that potential.

Jun 09 '25 17:06 apeabody

Hi @apeabody, Apologies for the delay - thanks for the suggestion, yes makes sense to me. I've rebased to include the v37 / main latest. If you could kick off the tests again that would be great. thanks, David

Jun 18 '25 23:06 davidholsgrove

/gcbrun

Jun 18 '25 23:06 apeabody

Hi @apeabody, Apologies for the delay - thanks for the suggestion, yes makes sense to me. I've rebased to include the v37 / main latest. If you could kick off the tests again that would be great. thanks, David

Great, that got further, this is from the test itself:

Step #55 - "verify node-pool-local":         	Error:      	Not equal: 
Step #55 - "verify node-pool-local":         	            	expected: "EFFECTIVE_CGROUP_MODE_V1"
Step #55 - "verify node-pool-local":         	            	actual  : "EFFECTIVE_CGROUP_MODE_V2"
Step #55 - "verify node-pool-local":         	            	
Step #55 - "verify node-pool-local":         	            	Diff:
Step #55 - "verify node-pool-local":         	            	--- Expected
Step #55 - "verify node-pool-local":         	            	+++ Actual
Step #55 - "verify node-pool-local":         	            	@@ -1 +1 @@
Step #55 - "verify node-pool-local":         	            	-EFFECTIVE_CGROUP_MODE_V1
Step #55 - "verify node-pool-local":         	            	+EFFECTIVE_CGROUP_MODE_V2
Step #55 - "verify node-pool-local":         	Test:       	TestNodePool
Step #55 - "verify node-pool-local":         	Messages:   	For node "pool-01" path "config.effectiveCgroupMode" expected "EFFECTIVE_CGROUP_MODE_V2" to match fixture "EFFECTIVE_CGROUP_MODE_V1"

Jun 19 '25 00:06 apeabody

Thanks again @apeabody - its hard to guess the test logic here without being able to trigger or see the logs.

I've updated the node_pools_cgroup_mode in examples/node_pool/main.tf to include explicitly setting node pool-01 to CGROUP_MODE_V1 to maintain that test.

Could we try again?

Jun 19 '25 03:06 davidholsgrove

/gcbrun

Jun 20 '25 17:06 apeabody

Thanks again @apeabody - its hard to guess the test logic here without being able to trigger or see the logs.

I've updated the node_pools_cgroup_mode in examples/node_pool/main.tf to include explicitly setting node pool-01 to CGROUP_MODE_V1 to maintain that test.

Could we try again?

Yes, this indicates that the observed actual values does not match the test data here: https://github.com/davidholsgrove/terraform-google-kubernetes-engine/blob/nap-cgroup-mode/test/integration/node_pool/testdata/TestNodePool.json

Jun 20 '25 17:06 apeabody

OK, this error is from the cluster, rather than a specific nodepool:

Step #55 - "verify node-pool-local":         	Error:      	Not equal: 
Step #55 - "verify node-pool-local":         	            	expected: "EFFECTIVE_CGROUP_MODE_V1"
Step #55 - "verify node-pool-local":         	            	actual  : "EFFECTIVE_CGROUP_MODE_V2"
Step #55 - "verify node-pool-local":         	            	
Step #55 - "verify node-pool-local":         	            	Diff:
Step #55 - "verify node-pool-local":         	            	--- Expected
Step #55 - "verify node-pool-local":         	            	+++ Actual
Step #55 - "verify node-pool-local":         	            	@@ -1 +1 @@
Step #55 - "verify node-pool-local":         	            	-EFFECTIVE_CGROUP_MODE_V1
Step #55 - "verify node-pool-local":         	            	+EFFECTIVE_CGROUP_MODE_V2
Step #55 - "verify node-pool-local":         	Test:       	TestNodePool
Step #55 - "verify node-pool-local":         	Messages:   	For path "nodeConfig.effectiveCgroupMode" expected "EFFECTIVE_CGROUP_MODE_V2" to match fixture "EFFECTIVE_CGROUP_MODE_V1"

Jun 20 '25 21:06 apeabody

Thanks @apeabody - returning to this PR, as we still see the same perma-diff with the latest releases of the module.

I think I've updated the expected result for the cluster nodepool effectiveCgroupMode. Could you rerun the tests?

Sep 05 '25 04:09 davidholsgrove

/gcbrun

Sep 05 '25 16:09 apeabody

terraform-google-kubernetes-engine terraform-google-kubernetes-engine copied to clipboard

fix: Add the nodepool cgroup mode to the NAP config

terraform-google-kubernetes-engine
terraform-google-kubernetes-engine copied to clipboard