cdk-eks-blueprints Windows configuration in aws-auth ConfigMap gets corrupted

Describe the bug

When the aws-auth entry in ConfigMap gets updated, the Windows-specific eks:kube-proxy-windows group mapping may get removed and existing Windows node groups end up in an unhealthy/degraded state.

Expected Behavior

If a cluster is configured to run Windows nodes and the eks:kube-proxy-windows group mapping exists, any updates to aws-auth in ConfigMap must not overwrite the existing group mapping.

Current Behavior

A CDK deployment may overwrite the aws-auth configuration. Any other AddOn may as well.

Reproduction Steps

It's pretty simple to create a cluster with an unhealthy windows node group:

Use the 1.13.1 or 1.14.0 of eks-blueprint
Create a new EKS cluster using version 1.28 or 1.29
Use the most recent addon/helmcharts versions (see details)
Configure a Linux and Windows node group
Configure one platform team
Deploy all at once

Luckily, it's pretty simple to fix it again:

Deploy a new Windows node group (the aws-auth configuration is updated with eks:kube-proxy-windows mapping)

Sadly, it's pretty simple to break it again:

Deploy a new Team configuration (anything that updates the aws-auth configuration will remove/overwrite the eks:kube-proxy-windows mapping)

Luckily, it's pretty simple to fix it again:

Deploy a new Windows node group

… you can continue this forever …

Possible Solution

Don't know. It's a nasty problem that one AddOn knows about the Windows-specific configuration and other AddOns naively overwrite the aws-auth configuration.

With using CloudWatch Insights, you can identify the API request that "fixes" the aws-auth and re-adds the needed eks:kube-proxy-windows mapping.

fields @timestamp, @message, @logStream, @log
| filter @message like /aws-auth/
| filter @message like /kube-proxy-windows/
| sort @timestamp desc
| limit 1000

I don't know about the EKS/k8s internals, but maybe it's somehow possible to "trigger" the fixing update without the need to create a new Windows node group.

Additional Information/Context

AddOn Configuration

[
    new blueprints.addons.KubeProxyAddOn('v1.29.1-eksbuild.2'),

    new blueprints.addons.VpcCniAddOn({
      version: 'v1.16.4-eksbuild.2',
      enablePodEni: true,
      enableWindowsIpam: true,
      serviceAccountPolicies: [ManagedPolicy.fromAwsManagedPolicyName('AmazonEKS_CNI_Policy')],
    }),

    new blueprints.addons.CoreDnsAddOn('v1.11.1-eksbuild.6'),

    new blueprints.addons.AwsLoadBalancerControllerAddOn({
      version: '1.7.1',
      enableWafv2: true,
    }),

    new blueprints.addons.ExternalDnsAddOn({
      version: '1.14.3',
      hostedZoneResources: [blueprints.GlobalResources.HostedZone],
      values: {
        policy: 'sync',
      },
    }),

    new blueprints.addons.CertManagerAddOn({
      version: '1.14.4',
      createNamespace: false,
    }),

    new blueprints.addons.EbsCsiDriverAddOn(),
    new blueprints.addons.SecretsStoreAddOn(),
]

The "broken" aws-auth results in an inline JSON string:

Creating a new node group will recreate the aws-auth in nice yaml format:

As soon as the node is ready, the aws-auth config is fixed:

After a few minutes, both windows node groups are healthy again …

CDK CLI Version

2.115.0

EKS Blueprints Version

1.14.0

Node.js Version

v18.17.1

Environment details (OS name and version, etc.)

macOS

Other information

Thanks to EKS/CloudFormation update durations this is horrible to debug 😭

Mar 13 '24 19:03 sbstjn

@sbstjn your CDK version is set to 2.115 - is that correct? You should have received an error/warning when upgrading your version of the blueprints as we pin peerDependency to the exact version.

Please upgrade to 2.132.0.

I believe the issue is caused by the enable-windows-ipam setting of VPC CNI which handles the modification of the aws-auth config map internally. That change is bypassing the CDK, in other words CDK is not aware of it.

CDK current behavior for aws-auth config map: accumulate all the modifications to the configmap and apply as a single document (as there is no patch command for it).

Potentially, we can look for an option to add this mapping in the blueprint with let's say a no-op team that only does the windows mapping part.

Mar 13 '24 20:03 shapirov103

I guess it's just a mismatch of "cdk version" and "yarn cdk version." It's been a chaos with this problem 😂 Will check later, but it should not affect this.

Having a generic "appendWindowsMapping: true/false" for the aws-auth generation could also do the trick. I'd rather like to have an explicit configuration than the current "magic update in the background."

In the worst case, I'd write a custom Lambda function and hook it to an event in EventBridge to overwrite the config after every potential change/deployment.

Currently, every CDK deployment has the potential risk of a corrupted group mapping and a degraded node group. Of course, this could also happen with plain kubectl commands …

It's not you, it's kubernetes. I know ☺️😂

Mar 13 '24 21:03 sbstjn

@renukakrishnan Pls check this.

Mar 14 '24 18:03 elamaran11

This messed up my EKS cluster again. A change that was totally unrelated to the Windows node did somehow update the values for aws-auth and removed the eks:kube-proxy-windows item.

Deleting/Recreating the node group triggered the "magic automated fix in the background."

Is there a way to enforce eks:kube-proxy-windows in the config map?

Jun 17 '24 14:06 sbstjn

@renukakrishnan Can you help on this, if you have bw.

Jun 17 '24 14:06 elamaran11