aws-cdk icon indicating copy to clipboard operation
aws-cdk copied to clipboard

eks: fail to create eks nodegroup in cn-north-1

Open BruceLuX opened this issue 2 years ago • 17 comments

Describe the bug

Hi, folks

I met a promble when use aws python cdk to create eks cluster. Please find information below:

My local env: (.venv) [ec2-user@ip-10-0-1-73 python-cdk]$ cdk --version 2.67.0 (build b6f7f39) (.venv) [ec2-user@ip-10-0-1-73 python-cdk]$ python3 --version Python 3.7.10 (.venv) [ec2-user@ip-10-0-1-73 python-cdk]$ cat /proc/version Linux version 5.10.144-127.601.amzn2.x86_64 (mockbuild@ip-10-0-44-229) (gcc10-gcc (GCC) 10.3.1 20210422 (Red Hat 10.3.1-1), GNU ld version 2.35-21.amzn2.0.1) #1 SMP Thu Sep 29 01:11:59 UTC 2022

Here is the core code:

node_role = iam.Role.from_role_arn(self, 'eks-node-role-arn-lookup', 'arn:aws-cn:iam::xxxxxxxxxxx:role/eks-node-role')

cluster.add_nodegroup_capacity(
    nodegroup_name,
    nodegroup_name=nodegroup_name,
    instance_types=[ec2.InstanceType(instance_type)],
    min_size=1,
    max_size=3,
    capacity_type=capacity_type,
    disk_size=disk_size,
    ami_type=ami_type
	node_role=node_role
)

I manually create the Node Role, and the cdk will deploy successfully, but when i remove the node_role parameter, like these:

cluster.add_nodegroup_capacity(
    nodegroup_name,
    nodegroup_name=nodegroup_name,
    instance_types=[ec2.InstanceType(instance_type)],
    min_size=1,
    max_size=2,
    capacity_type=capacity_type,
    disk_size=disk_size,
    ami_type=ami_type
)

Below error messages will be thrown :

Resource handler returned message: "Following required service principals [ec2.amazonaws.com.cn] were not found in the trust relations
hips of nodeRole arn:aws-cn:iam::4123xxxxxxx:role/eks-cluster-stack-eksgitlabrunnerclusterNodegroupg-1EPH8PW36YZ3A (Service: Eks, Sta
tus Code: 400, Request ID: 6f4cc1b1-4fd2-4072-887c-abc6ddf60d58)" (RequestToken: 7c7be61d-a2a5-3e36-1a34-e6a54c71d72a, HandlerErrorCod
e: InvalidRequest)

But i think the principals [ec2.amazonaws.com.cn] is right in cn-north-1 region.

Could you please help to check this problem ?

Expected Behavior

When I do not specify the node role in the method, i think cdk will automaticallycreate the node role.

Method doc : https://docs.aws.amazon.com/cdk/api/v1/python/aws_cdk.aws_eks/Cluster.html#aws_cdk.aws_eks.Cluster.add_nodegroup_capacity

Current Behavior

In the cn-north-1 region, CDk create the node role failed.

I had checked the principals which in my another ec2 role, the configuration [ec2.amazonaws.com.cn] is right.

It seems that CDK cannot recognize this principals

Reproduction Steps

Refer to the CDK code, when remove the node_role, it will create failed in cn-north-1 region.

Possible Solution

manually create the node role, and hard-code in the cdk code

Additional Information/Context

No response

CDK CLI Version

2.67.0

Framework Version

No response

Node.js Version

v16.18.0

OS

Amazon Linux2

Language

Python

Language Version

3.7.10

Other information

No response

BruceLuX avatar Mar 20 '23 09:03 BruceLuX

Hi,

Let me clarify this first.

  1. Does it happen only when you update your EKS deployment by removing your custom nodeRole?
  2. Are you having this error in cn-north-1

pahud avatar Mar 20 '23 15:03 pahud

I found the root cause here:

https://github.com/aws/aws-cdk/blob/3b7431b6ac27f8557c22a8959ae1ce431f6d2167/packages/%40aws-cdk/aws-eks/lib/managed-nodegroup.ts#L380

In China, this should be ec2.amazonaws.com.cn instead.

pahud avatar Mar 20 '23 18:03 pahud

OK I guess https://github.com/aws/aws-cdk/pull/22589 broke this.

image

This has been removed in https://github.com/aws/aws-cdk/pull/22589 but actually required for AWS China region.

pahud avatar Mar 20 '23 22:03 pahud

@pahud Hi, Pahud Many thanks for ur troubleshoot. I also met the similar promble when create the eks cluster by cdk.

I just create the eks cluster, not create the nodegroup and nodegroup role, this is my cdk code :

class EksClusterStack(Stack):
    def __init__(self, scope: Construct, identifier, **kwargs):
        super().__init__(scope, identifier, **kwargs)

        vpc = ec2.Vpc.from_lookup(
            self, "my-vpc", vpc_id=vpc_id
        )

        # eks cluster
        cluster = self.create_eks_cluster(vpc)
        
        """
        CfnOutput(self, "eks-cluster-arn-export", value=cluster.cluster_name, export_name="eks-cluster-name")
        """

    def create_eks_cluster(self, vpc):
        cluster = eks.Cluster(
            self,
            "eks-gitlab-runner-cluster",
            cluster_name=cluster_name,
            vpc=vpc,
            version=eks.KubernetesVersion.V1_24,
            default_capacity=0,
        )
        return cluster

I deploy the stack in cn-north-1, but the stack roll back finally.

And I check the cfn stack error, cfn stack prompted a sub-stack creation failure, so I checked the error message from the sub-stack and find the following log: Policy arn:aws-cn:iam::aws:policy/AmazonElasticContainerRegistryPublicReadOnly does not exist or is not attachable. (Service: AmazonIdentityManagement; Status Code: 404; Error Code: NoSuchEntity; Request ID: 99a1c2a5-992a-4e47-a0d0-357d9c73c70d; Proxy: null)

I am sure the policy 'AmazonElasticContainerRegistryPublicReadOnly' is aws managed policy which only use for global region, I cannt find this iam policy in China region.

Could you please help to check if it is the same root cause ?

BruceLuX avatar Mar 21 '23 05:03 BruceLuX

Yes I can't deploy even this to cn-north-1

import { App, Stack, StackProps,
  aws_eks as eks,
  aws_ec2 as ec2, 
} from 'aws-cdk-lib';
import { KubectlV25Layer as KubectlLayer } from '@aws-cdk/lambda-layer-kubectl-v25';

const vpc = ec2.Vpc.fromLookup(this, 'Vpc', { isDefault: true });
const cluster = new eks.Cluster(this, 'Cluster', {
  vpc,
  version: eks.KubernetesVersion.V1_25,
  kubectlLayer: new KubectlLayer(this, 'KubectlLayer'),
})

The error message is just as you described above:

Resource handler returned message: "Following required service principals [ec2.amazonaws.com.cn] were not found in the trust relations
hips of nodeRole arn:aws-cn:iam::4123xxxxxxx:role/eks-cluster-stack-eksgitlabrunnerclusterNodegroupg-1EPH8PW36YZ3A (Service: Eks, Sta
tus Code: 400, Request ID: 6f4cc1b1-4fd2-4072-887c-abc6ddf60d58)" (RequestToken: 7c7be61d-a2a5-3e36-1a34-e6a54c71d72a, HandlerErrorCod
e: InvalidRequest)

Looks like the EKS is expecting ec2 service principal name as ec2.amazonaws.com.cn but CDK is giving ec2.amazonaws.com. I am still working on this to get it sorted with internal teams.

pahud avatar Mar 21 '23 14:03 pahud

@Bruce-Lu674 I created https://github.com/aws/aws-cdk/issues/24743 for the missing AmazonElasticContainerRegistryPublicReadOnly bug FYR.

pahud avatar Mar 22 '23 14:03 pahud

@pahud Many thanks for your help, btw, Is there an expected resolution time for this issue? I can use the old version(2.65) to create the EKS Cluster.But I dont think use the old version is long-term solution.

BruceLuX avatar Mar 23 '23 08:03 BruceLuX

@Bruce-Lu674 The relevant team is working on it but I don't have ETA at this moment but I will update here when I see the issue is fixed(hopefully very soon).

btw, are you able to successfully deploy eks with cdk 2.65 in cn-north-1?

pahud avatar Mar 24 '23 19:03 pahud

@pahud Yes, I can deploy the EKS Cluster via cdk v2.65 in cn-north-1.

BruceLuX avatar Mar 25 '23 03:03 BruceLuX

Hi @Bruce-Lu674

Are you able to deploy the cluster AND a nodegroup with cdk v2.65.0 in cn-north-1 like this?


const cluster = new eks.Cluster(this, 'Cluster', {
  vpc,
  version: eks.KubernetesVersion.V1_24,
  defaultCapacity: 0,
  kubectlLayer,
});
const ng = cluster.addNodegroupCapacity('NG', {
  desiredSize: 2,
});  

pahud avatar Mar 28 '23 20:03 pahud

Hi Pahud @pahud , yes, I can create the eks cluster via v2.65 and v2.66, but without the Nodegroup resource. I think like this:

const cluster = new eks.Cluster(this, 'Cluster', {
  vpc,
  version: eks.KubernetesVersion.V1_24,
  defaultCapacity: 0,
  kubectlLayer,
});

Here is my python code:

vpc = ec2.Vpc.from_lookup(
            self, "my-vpc", vpc_id=vpc_id
        )
# eks cluster
cluster = self.create_eks_cluster(vpc)
def create_eks_cluster(self, vpc):
        cluster = eks.Cluster(
            self,
            "eks-cluster",
            cluster_name=cluster_name,
            vpc=vpc,
            default_capacity=0,
            version=eks.KubernetesVersion.V1_24
        )
        return cluster

BruceLuX avatar Mar 31 '23 09:03 BruceLuX

@Bruce-Lu674

Unfortunately I can't even successfully deploy the cluster. I'll keep diving deep for the root cause.

btw, do you have account on cdk.dev slack? Can you ping me on the slack so we can directly discuss more details?

pahud avatar Mar 31 '23 16:03 pahud

Hi

I am on CDK version 2.74.0, and this is still an issue. Any updates / ETA on a fix?

Thanks

ItielOlenick avatar Apr 16 '23 11:04 ItielOlenick

@ItielOlenick

Looks like the EKS is expecting ec2 service principal name as ec2.amazonaws.com.cn but CDK is giving ec2.amazonaws.com. I am still working on this to get it sorted with internal teams.

We are still working with internal teams to fix this but unfortunately no ETA at this moment. I'll share the update if any.

EKS in CN is having 2 additional issues as well and we probably need to fix them before we are allowed to deploy with the latest CDK.

  • https://github.com/aws/aws-cdk/pull/25215
  • https://github.com/aws/aws-cdk/issues/24358

pahud avatar Apr 20 '23 14:04 pahud

I can confirm we can successfully deploy EKS cluster in China regions with escape hatches as below:

import { KubectlV26Layer as KubectlLayer } from '@aws-cdk/lambda-layer-kubectl-v26';

const cluster = new eks.Cluster(scope, 'EksCluster', {
        vpc,
        version: eks.KubernetesVersion.V1_26,
        kubectlLayer: new KubectlLayer(scope, 'KubectlLayer'),
        defaultCapacity: 2,
    });

// override the service principal for the default nodegroup
overrideServicePrincipal(cluster.defaultNodegroup?.role.node.defaultChild as iam.CfnRole)

const ng = cluster.addNodegroupCapacity('NG', {
  desiredSize: 2,
});

// override the service principal for the additional nodegroup
overrideServicePrincipal(ng.role.node.defaultChild as iam.CfnRole)


function overrideServicePrincipal(role: iam.CfnRole) {
  role.addPropertyOverride('AssumeRolePolicyDocument.Statement.0.Principal.Service', ['ec2.amazonaws.com', 'ec2.amazonaws.com.cn'])
}
% kubectl get no
NAME                                          STATUS   ROLES    AGE     VERSION
ip-10-0-140-206.cn-north-1.compute.internal   Ready    <none>   2m34s   v1.26.2-eks-a59e1f0
ip-10-0-141-57.cn-north-1.compute.internal    Ready    <none>   2m20s   v1.26.2-eks-a59e1f0
ip-10-0-174-210.cn-north-1.compute.internal   Ready    <none>   2m34s   v1.26.2-eks-a59e1f0

This is a temporary fix for this issue from CDK.

pahud avatar May 03 '23 17:05 pahud

Hello @pahud ,

We are still encountering below when using latest cdk version to create eks and corresponding resources like helm chart etc, and I tested cdk-2.65.0 which looks good, however, it's hard for us to use this cdk version considering other facts, so do we have a ETA or workaround for this issue?


2023-05-19 14:12:02 UTC+0800 HandlerServiceRoleFCDC14AE CREATE_FAILED Policy arn:aws-cn:iam::aws:policy/AmazonElasticContainerRegistryPublicReadOnly does not exist or is not attachable. (Service: AmazonIdentityManagement; Status Code: 404; Error Code: NoSuchEntity; Request ID: 8a2723e1-3330-40e4-af9c-d45b6e6aa3b3; Proxy: null)


justin007755 avatar May 22 '23 05:05 justin007755

@justin007755 This bug should have been fixed in https://github.com/aws/aws-cdk/pull/25215

Please install the latest AWS CDK and let me know if it works for you.

pahud avatar Aug 02 '23 22:08 pahud

I am able to deploy this to cn-north-1 with the nodegroup up and running. Hence resolving this issue.

Image
import * as cdk from 'aws-cdk-lib/core';
import * as eks from 'aws-cdk-lib/aws-eks';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import { Construct } from 'constructs';
import { KubectlV31Layer } from '@aws-cdk/lambda-layer-kubectl-v31';

export class BjsEksStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Create VPC for EKS cluster
    const vpc = new ec2.Vpc(this, 'EksVpc', {
      maxAzs: 2, // Minimal setup with 2 AZs
      natGateways: 1, // Cost-effective single NAT gateway
    });

    // Create EKS cluster with minimal configuration
    const cluster = new eks.Cluster(this, 'EksCluster', {
      version: eks.KubernetesVersion.V1_31,
      vpc,
      defaultCapacity: 0, // We'll add managed node group separately
      endpointAccess: eks.EndpointAccess.PUBLIC_AND_PRIVATE,
      kubectlLayer: new KubectlV31Layer(this, 'kubectl'),
    });

    // Add managed node group
    cluster.addNodegroupCapacity('DefaultNodeGroup', {
      instanceTypes: [new ec2.InstanceType('t3.medium')],
      minSize: 1,
      maxSize: 3,
      desiredSize: 1,
      diskSize: 20, // GB
      amiType: eks.NodegroupAmiType.AL2_X86_64,
    });

    // Output cluster endpoint
    new cdk.CfnOutput(this, 'ClusterEndpoint', {
      value: cluster.clusterEndpoint,
      description: 'EKS Cluster Endpoint',
    });

    // Output cluster name
    new cdk.CfnOutput(this, 'ClusterName', {
      value: cluster.clusterName,
      description: 'EKS Cluster Name',
    });
  }
}

pahud avatar Dec 16 '25 15:12 pahud

Comments on closed issues and PRs are hard for our team to see. If you need help, please open a new issue that references this one.

github-actions[bot] avatar Dec 16 '25 15:12 github-actions[bot]

confirmed aws-eks-v2-alpha deploys in cn-north-1 as well.

pahud avatar Dec 16 '25 15:12 pahud