Windows configuration in aws-auth ConfigMap gets corrupted
Describe the bug
When the aws-auth entry in ConfigMap gets updated, the Windows-specific eks:kube-proxy-windows group mapping may get removed and existing Windows node groups end up in an unhealthy/degraded state.
Expected Behavior
If a cluster is configured to run Windows nodes and the eks:kube-proxy-windows group mapping exists, any updates to aws-auth in ConfigMap must not overwrite the existing group mapping.
Current Behavior
A CDK deployment may overwrite the aws-auth configuration. Any other AddOn may as well.
Reproduction Steps
It's pretty simple to create a cluster with an unhealthy windows node group:
- Use the
1.13.1or1.14.0ofeks-blueprint - Create a new EKS cluster using version 1.28 or 1.29
- Use the most recent addon/helmcharts versions (see details)
- Configure a Linux and Windows node group
- Configure one
platformteam - Deploy all at once
Luckily, it's pretty simple to fix it again:
- Deploy a new Windows node group (the
aws-authconfiguration is updated witheks:kube-proxy-windowsmapping)
Sadly, it's pretty simple to break it again:
- Deploy a new Team configuration (anything that updates the
aws-authconfiguration will remove/overwrite theeks:kube-proxy-windowsmapping)
Luckily, it's pretty simple to fix it again:
- Deploy a new Windows node group
β¦ you can continue this forever β¦
Possible Solution
Don't know. It's a nasty problem that one AddOn knows about the Windows-specific configuration and other AddOns naively overwrite the aws-auth configuration.
With using CloudWatch Insights, you can identify the API request that "fixes" the aws-auth and re-adds the needed eks:kube-proxy-windows mapping.
fields @timestamp, @message, @logStream, @log
| filter @message like /aws-auth/
| filter @message like /kube-proxy-windows/
| sort @timestamp desc
| limit 1000
I don't know about the EKS/k8s internals, but maybe it's somehow possible to "trigger" the fixing update without the need to create a new Windows node group.
Additional Information/Context
AddOn Configuration
[
new blueprints.addons.KubeProxyAddOn('v1.29.1-eksbuild.2'),
new blueprints.addons.VpcCniAddOn({
version: 'v1.16.4-eksbuild.2',
enablePodEni: true,
enableWindowsIpam: true,
serviceAccountPolicies: [ManagedPolicy.fromAwsManagedPolicyName('AmazonEKS_CNI_Policy')],
}),
new blueprints.addons.CoreDnsAddOn('v1.11.1-eksbuild.6'),
new blueprints.addons.AwsLoadBalancerControllerAddOn({
version: '1.7.1',
enableWafv2: true,
}),
new blueprints.addons.ExternalDnsAddOn({
version: '1.14.3',
hostedZoneResources: [blueprints.GlobalResources.HostedZone],
values: {
policy: 'sync',
},
}),
new blueprints.addons.CertManagerAddOn({
version: '1.14.4',
createNamespace: false,
}),
new blueprints.addons.EbsCsiDriverAddOn(),
new blueprints.addons.SecretsStoreAddOn(),
]
The "broken" aws-auth results in an inline JSON string:
Creating a new node group will recreate the aws-auth in nice yaml format:
As soon as the node is ready, the aws-auth config is fixed:
After a few minutes, both windows node groups are healthy again β¦
CDK CLI Version
2.115.0
EKS Blueprints Version
1.14.0
Node.js Version
v18.17.1
Environment details (OS name and version, etc.)
macOS
Other information
Thanks to EKS/CloudFormation update durations this is horrible to debug π
@sbstjn your CDK version is set to 2.115 - is that correct? You should have received an error/warning when upgrading your version of the blueprints as we pin peerDependency to the exact version.
Please upgrade to 2.132.0.
I believe the issue is caused by the enable-windows-ipam setting of VPC CNI which handles the modification of the aws-auth config map internally. That change is bypassing the CDK, in other words CDK is not aware of it.
CDK current behavior for aws-auth config map: accumulate all the modifications to the configmap and apply as a single document (as there is no patch command for it).
Potentially, we can look for an option to add this mapping in the blueprint with let's say a no-op team that only does the windows mapping part.
I guess it's just a mismatch of "cdk version" and "yarn cdk version." It's been a chaos with this problem π Will check later, but it should not affect this.
Having a generic "appendWindowsMapping: true/false" for the aws-auth generation could also do the trick. I'd rather like to have an explicit configuration than the current "magic update in the background."
In the worst case, I'd write a custom Lambda function and hook it to an event in EventBridge to overwrite the config after every potential change/deployment.
Currently, every CDK deployment has the potential risk of a corrupted group mapping and a degraded node group. Of course, this could also happen with plain kubectl commands β¦
It's not you, it's kubernetes. I know βΊοΈπ
@renukakrishnan Pls check this.
This messed up my EKS cluster again. A change that was totally unrelated to the Windows node did somehow update the values for aws-auth and removed the eks:kube-proxy-windows item.
Deleting/Recreating the node group triggered the "magic automated fix in the background."
Is there a way to enforce eks:kube-proxy-windows in the config map?
@renukakrishnan Can you help on this, if you have bw.