aws_eks: Error creating `FargateCluster` in `cn-north-1` due to `CoreDnsComputeTypePatch` creation error
Describe the bug
Towards the end of a FargateCluster deployment, several resources fail to create, resulting in a rollback/delete.
Expected Behavior
I expect the cluster to be created smoothly, as I believe it is supported in this region and has successfully deployed on us-east-1 with the same configuration.
Current Behavior
When creating a resource with a logical ID k8sclusterCoreDnsComputeTypePatch2EEF5C89, it fails with the following status reason:
CloudFormation did not receive a response from your Custom Resource. Please check your logs for requestId [4af278ec-eb20-4abc-8d38-4e76661d6112]. If you are using the Python cfn-response module, you may need to update your Lambda function code so that CloudFormation can attach the updated version.
Create for the last remaining necessary resources also fails because this one fails.
Reproduction Steps
Cluster creation code:
cluster = eks.FargateCluster(
self,
"k8s-cluster",
cluster_name=f"k8s-{stage_name}",
version=eks.KubernetesVersion.V1_26,
vpc=self.vpc,
vpc_subnets=[subnet_selection],
cluster_logging=[
eks.ClusterLoggingTypes.API,
eks.ClusterLoggingTypes.AUTHENTICATOR,
eks.ClusterLoggingTypes.SCHEDULER,
],
kubectl_layer=lambda_layer_kubectl_v26.KubectlV26Layer(
self, "kubectl-v26-layer"
),
masters_role=masters_role,
)
Possible Solution
Possibly trying to apply a patch that requires Global internet access, but needs to use a mirror in China? Other than that, not sure why something in China would fail.
Additional Information/Context
No response
CDK CLI Version
2.86.0
Framework Version
No response
Node.js Version
18.30
OS
Mac OS X
Language
Python
Language Version
3.9.15
Other information
No response
Yes I can reproduce this at cn-north-1 but I can't figure out the root cause off the top of my head.
Making this a p1. Will update here if I find anything.
Looks like it's failing to install this in cn-north-1
https://github.com/aws/aws-cdk/blob/300989a675bd9fc9c2829c5115efe34e753e0976/packages/aws-cdk-lib/aws-eks/lib/cluster.ts#L2019C20-L2025
ref: https://docs.aws.amazon.com/eks/latest/userguide/fargate-getting-started.html#fargate-gs-coredns
I think I figured out the reason, or at least the solution. When I create this cluster, I use a subnet selection that includes the availability zones that have EKS capacity, cn-north-1a and cn-north-1b. I had a couple of completely private subnets (no VPC endpoints) in those AZs. To create the cluster, CDK creates an ENI that uses the fully private subnets, and I'm guessing the ClusterAwsAuthmanifest and ClusterCoreDNSComputeTypePatch need internet access, as they seem to time out.
When I change the subnet selection to explicitly be subnets that have VPC endpoints, I am able to finish creating the cluster without a problem.
If these patches are indeed requiring internet access, would it be possible to give a warning at synth time stating that the selected subnets may not be suitable? I believe I've seen warnings like this before when creating another resource that I passed subnets into. Or otherwise, documenting that somewhere would be helpful.
+1 I am having this issue with private subnets in us-east-1/us-west-2. Likely not a region specific issue, but because we use private subnets with VPC Endpoints for most services, this is still not working...
I agree this is not region specific. I think currently coreDNS on EKS fargate needs a public subnet to be able to patch it. This is a bug which should be fixed
I agree that it is not region specific and I support the idea from @Howlla . I created a natgateway in the public subnet of the vpc, adjusted my route table to point to natgateway-id, then I was able to avoid this error. This might help someone too.