aws-eks: Constructing a `AwsAuth` can completely break a cluster
Describe the bug
We recently added a new AwsAuth construct to our cdk app. We were shocked to find that this broke all our node groups (as the new AwsAuth construct overwrote the aws-auth ConfigMap from cdk's internally managed AwsAuth construct).
Expected Behavior
I expect CDK to enforce that there is exactly one AwsAuth construct per cluster.
Current Behavior
As mentioned above, the new AwsAuth construct clobbered the aws-auth ConfigMap managed by the cluster's internal AwsAuth construct.
Reproduction Steps
Here's a simple cdk app that declares a cluster, a nodegroup for that cluster, and an explicit AwsAuth for that cluster:
$ cat demo.py
import pprint
import aws_cdk as cdk
from aws_cdk import aws_ec2
from aws_cdk import aws_eks
from aws_cdk.assertions import Template
class DemoStack(cdk.Stack):
def __init__(self, scope):
super().__init__(scope, "DemoStack")
cluster = aws_eks.FargateCluster(
scope=self,
id="demo-cluster",
cluster_name="demo-cluster",
version=aws_eks.KubernetesVersion.V1_24,
)
cluster.add_nodegroup_capacity(
'demo-nodegroup',
nodegroup_name='demo-nodegroup',
instance_types=[aws_ec2.InstanceType('c6a.xlarge')],
)
aws_eks.AwsAuth(self, "problematic-aws-auth", cluster=cluster)
app = cdk.App()
demo_stack = DemoStack(app)
template = Template.from_stack(demo_stack)
custom_resources = template.find_resources("Custom::AWSCDK-EKS-KubernetesResource")
pprint.pprint(custom_resources)
Note how this produces 2 Custom::AWSCDK-EKS-KubernetesResource resources. These two resources will happily step on each other's toes (because of this (unfortunately necessary) overwrite: true).
$ python demo.py
{'democlusterAwsAuthmanifest59F406F8': {'DeletionPolicy': 'Delete',
'DependsOn': ['democlusterKubectlReadyBarrier409E0356'],
'Properties': {'ClusterName': {'Ref': 'democlusterE73BD733'},
'Manifest': {'Fn::Join': ['',
['[{"apiVersion":"v1","kind":"ConfigMap","metadata":{"name":"aws-auth","namespace":"kube-system","labels":{"aws.cdk.eks/prune-c8e20f52ee1502bfdc03c85de68fda3a9cbb05b58d":""}},"data":{"mapRoles":"[{\\"rolearn\\":\\"',
{'Fn::GetAtt': ['democlusterfargateprofiledefaultPodExecutionRoleB482677B',
'Arn']},
'\\",\\"username\\":\\"system:node:{{SessionName}}\\",\\"groups\\":[\\"system:bootstrappers\\",\\"system:nodes\\",\\"system:node-proxier\\"]},{\\"rolearn\\":\\"',
{'Fn::GetAtt': ['democlusterNodegroupdemonodegroupNodeGroupRole6A5CDA1C',
'Arn']},
'\\",\\"username\\":\\"system:node:{{EC2PrivateDNSName}}\\",\\"groups\\":[\\"system:bootstrappers\\",\\"system:nodes\\"]}]","mapUsers":"[]","mapAccounts":"[]"}}]']]},
'Overwrite': True,
'PruneLabel': 'aws.cdk.eks/prune-c8e20f52ee1502bfdc03c85de68fda3a9cbb05b58d',
'RoleArn': {'Fn::GetAtt': ['democlusterCreationRole733E3675',
'Arn']},
'ServiceToken': {'Fn::GetAtt': ['awscdkawseksKubectlProviderNestedStackawscdkawseksKubectlProviderNestedStackResourceA7AEBA6B',
'Outputs.DemoStackawscdkawseksKubectlProviderframeworkonEvent705EBCF2Arn']}},
'Type': 'Custom::AWSCDK-EKS-KubernetesResource',
'UpdateReplacePolicy': 'Delete'},
'problematicawsauthmanifestEB215835': {'DeletionPolicy': 'Delete',
'DependsOn': ['democlusterKubectlReadyBarrier409E0356'],
'Properties': {'ClusterName': {'Ref': 'democlusterE73BD733'},
'Manifest': '[{"apiVersion":"v1","kind":"ConfigMap","metadata":{"name":"aws-auth","namespace":"kube-system","labels":{"aws.cdk.eks/prune-c8993b9de23d4ff03256c14b261e46b123d26fe727":""}},"data":{"mapRoles":"[]","mapUsers":"[]","mapAccounts":"[]"}}]',
'Overwrite': True,
'PruneLabel': 'aws.cdk.eks/prune-c8993b9de23d4ff03256c14b261e46b123d26fe727',
'RoleArn': {'Fn::GetAtt': ['democlusterCreationRole733E3675',
'Arn']},
'ServiceToken': {'Fn::GetAtt': ['awscdkawseksKubectlProviderNestedStackawscdkawseksKubectlProviderNestedStackResourceA7AEBA6B',
'Outputs.DemoStackawscdkawseksKubectlProviderframeworkonEvent705EBCF2Arn']}},
'Type': 'Custom::AWSCDK-EKS-KubernetesResource',
'UpdateReplacePolicy': 'Delete'}}
Possible Solution
I have read through https://github.com/aws/aws-cdk/issues/19218, which talks about a similar issue (the aws-auth ConfigMap getting unexpectedly clobbered), but that's all about using multiple tools to manage a cluster. In the scenario described here, we're using exactly 1 CDK app to manage our cluster, and IMO, CDK should protect us against this.
Some ideas for how to fix this:
- When constructing a
AwsAuthfor a cluster, check if there's already aAwsAuthinstantiated for that cluster. If there is one already, crash. - Don't expose the
AwsAuthconstruct at all, just keep it as an internal implementation detail of Cluster. As I understand things, it's never something you'd want to use directly when you've declared your cluster in cdk (instead you'd want to interact with the underlyingcluster.awsAuthattribute as documented here). - (what we're going to do until this is addressed in cdk itself) Add a post-synthesis step that loops over all
AwsAuthconstructs in the tree and asserts that they're for different clusters.
Additional Information/Context
No response
CDK CLI Version
2.109.0 (build 941dc16)
Framework Version
No response
Node.js Version
v18.15.0
OS
N/A (this happens both on macOS and Linux)
Language
Python
Language Version
Python (3.10.6)
Other information
No response
Thanks for the report.
Actually every Cluster comes with a built-in AwsAuth and you don't need to create a new one, instead, you should update the existing one like this as described in the doc:
cluster.awsAuth.addMastersRole(role);
Is there any reason you need to create another one for the existing cluster?
Is there any reason you need to create another one for the existing cluster?
@pahud, no. For more context: we recently had a complete outage of one of our clusters because we deployed a diff adding a new AwsAuth construct to our cdk app. That definitely wasn't what we meant to do (as you said, we should have used cluster.awsAuth.addMastersRole), but mistakes happen, and IMO this is very dangerous footgun for CDK to have.
I totally understand that it's always going to be possible for people to do dangerous things with an IAC tool (we could delete our cluster, for instance), but we are diligent about reviewing all code changes to our CDK app, and nobody at our company recognized that this diff would cause a cluster outage:
--- before
+++ after
@@ -14,3 +14,5 @@
nodegroup_name='demo-nodegroup',
instance_types=[aws_ec2.InstanceType('c6a.xlarge')],
)
+
+ aws_eks.AwsAuth(self, "problematic-aws-auth", cluster=cluster)
Running cdk diff doesn't help either, you just see the addition of a custom resource, with no clue that this is going to do something very dangerous:
$ cdk diff
Stack DemoStack
Resources
[+] Custom::AWSCDK-EKS-KubernetesResource problematic-aws-auth/manifest/Resource problematicawsauthmanifestEB215835
In other words: it's totally ok for CDK to enable us to do dangerous things, but (IMO) those dangerous things need to at a least look dangerous. For example, deleting a cluster is very obviously a dangerous thing, both when you look at the code diff, and the cdk diff. As an example, if the AwsAuth class were instead renamed to something like DangerouslyClobberAwsAuthInYourClusterYouProbablyDoNotWantToConstructMeYourself, then we probably wouldn't have had our outage =)
That all said, I still think it would be better for CDK to just ensure that there are exactly 0 or 1 AwsAuth constructs per cluster. I think this would be a pretty straightforward change that would remove this footgun for other people. I'd be happy to send in such a PR if that sounds acceptable to you?
@pahud, any update here?
@jfly Totally agreed with you.
Yes, it sounds good to have a PR as the guardrail. Feel free to submit your PR for that. Thanks.
I just learned that this aws-auth ConfigMap is getting replaced with a new access entity and access policy mechanism: https://aws.amazon.com/blogs/containers/a-deep-dive-into-simplified-amazon-eks-access-management-controls/. We're going to put our effort into that, which should avoid the issue described here.
Yes, it's related to https://github.com/aws/aws-cdk/issues/28588
I just got hit by this issue and lost a full day of engineering for an entire team as we repaired all of the test clusters that this deployed to before we realised it broke everything.
Minimum viable fix - Please add a warning in the documentation to say: "Do not instantiate this class directly. Use the ICluster.awsAuth property", until you can prioritize the deprecation of the public constructor.
Of note - I was using Cdk Blueprints for eks to build the cluster in one stack, and my google search on how I can add awd auth permissions for my lambda in a related stack from cdk got me to the AwsAuth construct, rather than the instance of the class on ICluster.