aws-cdk icon indicating copy to clipboard operation
aws-cdk copied to clipboard

aws_eks: Error creating `FargateCluster` in `cn-north-1` due to `CoreDnsComputeTypePatch` creation error

Open jlubins opened this issue 2 years ago • 6 comments

Describe the bug

Towards the end of a FargateCluster deployment, several resources fail to create, resulting in a rollback/delete.

Expected Behavior

I expect the cluster to be created smoothly, as I believe it is supported in this region and has successfully deployed on us-east-1 with the same configuration.

Current Behavior

When creating a resource with a logical ID k8sclusterCoreDnsComputeTypePatch2EEF5C89, it fails with the following status reason:

CloudFormation did not receive a response from your Custom Resource. Please check your logs for requestId [4af278ec-eb20-4abc-8d38-4e76661d6112]. If you are using the Python cfn-response module, you may need to update your Lambda function code so that CloudFormation can attach the updated version.

Create for the last remaining necessary resources also fails because this one fails.

Reproduction Steps

Cluster creation code:

cluster = eks.FargateCluster(
            self,
            "k8s-cluster",
            cluster_name=f"k8s-{stage_name}",
            version=eks.KubernetesVersion.V1_26,
            vpc=self.vpc,
            vpc_subnets=[subnet_selection],
            cluster_logging=[
                eks.ClusterLoggingTypes.API,
                eks.ClusterLoggingTypes.AUTHENTICATOR,
                eks.ClusterLoggingTypes.SCHEDULER,
            ],
            kubectl_layer=lambda_layer_kubectl_v26.KubectlV26Layer(
                self, "kubectl-v26-layer"
            ),
            masters_role=masters_role,
        )

Possible Solution

Possibly trying to apply a patch that requires Global internet access, but needs to use a mirror in China? Other than that, not sure why something in China would fail.

Additional Information/Context

No response

CDK CLI Version

2.86.0

Framework Version

No response

Node.js Version

18.30

OS

Mac OS X

Language

Python

Language Version

3.9.15

Other information

No response

jlubins avatar Aug 02 '23 21:08 jlubins

Yes I can reproduce this at cn-north-1 but I can't figure out the root cause off the top of my head.

Making this a p1. Will update here if I find anything.

image

pahud avatar Aug 03 '23 00:08 pahud

Looks like it's failing to install this in cn-north-1

https://github.com/aws/aws-cdk/blob/300989a675bd9fc9c2829c5115efe34e753e0976/packages/aws-cdk-lib/aws-eks/lib/cluster.ts#L2019C20-L2025

ref: https://docs.aws.amazon.com/eks/latest/userguide/fargate-getting-started.html#fargate-gs-coredns

pahud avatar Aug 03 '23 00:08 pahud

I think I figured out the reason, or at least the solution. When I create this cluster, I use a subnet selection that includes the availability zones that have EKS capacity, cn-north-1a and cn-north-1b. I had a couple of completely private subnets (no VPC endpoints) in those AZs. To create the cluster, CDK creates an ENI that uses the fully private subnets, and I'm guessing the ClusterAwsAuthmanifest and ClusterCoreDNSComputeTypePatch need internet access, as they seem to time out.

When I change the subnet selection to explicitly be subnets that have VPC endpoints, I am able to finish creating the cluster without a problem.

If these patches are indeed requiring internet access, would it be possible to give a warning at synth time stating that the selected subnets may not be suitable? I believe I've seen warnings like this before when creating another resource that I passed subnets into. Or otherwise, documenting that somewhere would be helpful.

jlubins avatar Aug 05 '23 20:08 jlubins

+1 I am having this issue with private subnets in us-east-1/us-west-2. Likely not a region specific issue, but because we use private subnets with VPC Endpoints for most services, this is still not working...

caretak3r avatar Nov 14 '23 14:11 caretak3r

I agree this is not region specific. I think currently coreDNS on EKS fargate needs a public subnet to be able to patch it. This is a bug which should be fixed

Howlla avatar Nov 14 '23 15:11 Howlla

I agree that it is not region specific and I support the idea from @Howlla . I created a natgateway in the public subnet of the vpc, adjusted my route table to point to natgateway-id, then I was able to avoid this error. This might help someone too.

Temmy-dev avatar May 07 '24 07:05 Temmy-dev