eksctl icon indicating copy to clipboard operation
eksctl copied to clipboard

Support CNI Custom Networking

Open lilley2412 opened this issue 5 years ago • 19 comments

Why do you want this feature?

Proposed feature request to support AWS CNI custom networking. There is an existing WIP PR 786 here, opening a feature for discussion because I think it's a big lift and implementation approach could greatly vary; I'm using custom CNI extensively to increase VPC IP space for pods on a corporate-connected VPCs and would like to start using eksctl for cluster provisioning.

CNI custom networking supports 3 main use cases:

  1. Save IP space in a VPC. VPC's connected to a corporate network with VPN or DirectConnect often have limited CIDR sizes, and with each pod taking an IP they run out fast.
  2. Enable nodes in public subnets with pods in private subnets by externalizing SNAT (disable aws-node SNAT).
  3. Run pods in security groups different than nodes.

Custom CNI can be enabled outside of eksctl today, but requires manual intervention after eksctl builds the cluster.

Important Concepts:

  • Each node primary ENI is created in a non-pod subnet (i.e. the public or private subnets)
  • Node secondary ENI's are created in pod subnets
  • The node primary ENI and secondary ENI subnets must be in the same AZ
  • Pod ENIs require a security group assignment (it can be same as nodes or different but must be specified)
  • Behavior is driven by a ENIConfig CRD that must exist before node registers
  • The aws-node daemonset must be patched before node registers to enable the custom CNI feature
  • Use of custom CNI custom networking effectively removes IP capacity of 1 ENI from the node; this is because no secondary IP's of the primary ENI are used for pods. As of today the EKS optimized AMI uses a static file to set max-pods, max-pods must be overriden when custom networking is used to prevent the scheduler from adding pods that cannot get an IP. See AWS CNI issue 527.

What feature/behavior/change do you want?

  • Support CNI custom networking for pre-existing VPCs by adding a new ClusterSubnet type "pod-private" and "pod-public":
    vpc:
    #############################################################
    # existing functionality to specify existing vpc and subnets
    id: "vpc-1"  
    cidr: "10.0.0.0/16"       
    subnets:
        private:
        us-east-1a:
            id: "subnet-1"
            cidr: "10.0.0.0/24"
    
        us-east-1b:
            id: "subnet-2"
            cidr: "10.0.1.0/24"
    
        #############################################################
        # new functionality to specify existing pod subnets
        # import a public pod subnet
        pod-public:
           us-east-1a:
               id: "subnet-1"
               cidr: "100.64.0.0/16"
    
        # import a private pod subnet
        pod-private:
           us-east-1b:
               id: "subnet-2"
               cidr: "100.65.0.0/16"
    
  • Add new optional property to nodegroup to specify a pod subnet by id
  • extend the concept of subnet id to worker subnets also while making this change?
  • Custom CNI networking is enabled either by specifying pod subnets for a pre-existing VPC or by specifying customCniPodCidr for a new VPC (see proposed future enhancements for new VPC use case).
  • Config validations:
    • A pod subnet must exist in each AZ that has a public or private subnet (this is a requirement of ENI allocation, CNI creates a primary ENI in node subnet and secondary ENIs in pod subnets on the same EC2 node, so the AZ's must match).
  • Use ClusterSharedNodeSecurityGroup as the pod SG, I think this is a reasonable first-version approach. Adding a dedicated pod SG falls under use case 3, but this would require additional cluster-level configs for attaching pre-existing SG's or creating new POD sg's (not sure there is value in a flag to create a new pods SG, how would the rules different from the shared node sg? Maybe allowing attachment of existing is enough.).
  • label nodegroups based on pod subnet-id selection and update ENI_CONFIG_LABEL_DEF with label name
  • Implement formula to set --max-pods based on instance type. See https://github.com/aws/amazon-vpc-cni-k8s/issues/331. May require maintaining a file like amazon-eks-ami/eni-max-pods.txt.
  • Support CNI custom networking for new VPCs with a list that defines a custom CNI pod CIDRs. The CIDR can be the same as the VPC or different (if different, can't overlap). Ex:
    vpc:
        cidr: "10.0.0.0/16"
        customCniPodCidrs: 
        - "100.64.0.0/16" 
        - "100.65.0.0/16"
        - "100.66.0.0/16"
        # could also be the same as the VPC as a way to enable custom
        # CNI without adding an additional boolean flag (not sure this would be
       # be a very common use case). Ex:
        # customCniPodCidrs: 
        # - "10.0.0.0/16"
    
    • For new VPC creation:
      • Need to identify correct algorithm to determine # and size of pod subnets ... at a minimum must provide 1 pod subnet per AZ used by public/private subnets

Implementation Notes

Resources should be created in this order, if nodes register prior to custom CNI being fully configured, node ENIs will immediately get secondary ENIs assigned from non-pod subnets, requiring the nodes (or kubelet process) to be re-started after CNI is configured (or nodeGroup re-creation).

  • Create / validate VPC
  • Create cluster (includes ClusterSharedNodeSecurityGroup)
  • patch the aws-node daemonset:
        spec:
        containers:
        - env:
            - name: AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG
            value: "true"
            - name: ENI_CONFIG_LABEL_DEF
            value: TODO - label name that will be used for nodes
    
  • Create ENIConfig CRD
    apiVersion: apiextensions.k8s.io/v1beta1
    kind: CustomResourceDefinition
    metadata:
    name: eniconfigs.crd.k8s.amazonaws.com
    spec:
    scope: Cluster
    group: crd.k8s.amazonaws.com
    version: v1alpha1
    names:
        plural: eniconfigs
        singular: eniconfig
        kind: ENIConfig
    
  • Create 1 ENIConfig resource per AZ, using ClusterSharedNodeSecurityGroup, named by the AZ (to support auto-apply using ENI_CONFIG_LABEL_DEF):
    apiVersion: crd.k8s.amazonaws.com/v1alpha1
    kind: ENIConfig
    metadata: 
    name: <az-name>
    spec: 
    securityGroups: 
        - <ClusterSharedNodeSecurityGroup ID>
    subnet: <pod subnet ID>
    

Proposed Future Enhancements

  • Allow pod sg's that are different than node sg's (requires new config at cluster level).
  • Support external SNAT (nodes in public subnets with pods in private subnets) - requires new config setting and further patching of aws-node [see aws docs](see aws docs).

lilley2412 avatar Aug 05 '19 18:08 lilley2412

@lilley2412 Great writeup!

Do you think there's any scenario in which someone might want to use public subnets for the pod? I always used private, but when starting to code it previously I was wondering if it should have the option of both or not.

    # new functionality to specify existing pod subnets
    pod:
    us-east-1a:
        id: "subnet-1"
        cidr: "100.64.0.0/16"

    us-east-1b:
        id: "subnet-2"
        cidr: "100.65.0.0/16"

For customCniPodCidr do you think there'd only be one? @errordeveloper and I had discussed having an array of cidrs. I think we need to think about it for eksctl creating subnets too (thus far my code creates them). It would also be more of a future thing, but this way with multiple cidrs, and having a group inside of pod subnets for each CIDR, then different groups can be assigned to different nodegroups. And obviously we only need some things at first, but it does affect the layout of the yaml and it'd be better if it doesn't drastically change.

Use ClusterSharedNodeSecurityGroup as the pod SG

Agreed

tiffanyfay avatar Aug 05 '19 20:08 tiffanyfay

@tiffanyfay thanks for the feedback

Do you think there's any scenario in which someone might want to use public subnets for the pod?

I can't think of a good reason use custom CNI with public pod subnets, all the use cases for custom CNI lend themselves to security or saving public or private "routable" IP space, so public pod subnets just don't seem to make sense. That doesn't mean someone won't want it one day, so I also see no strong reason to prohibit it. Not sure the cleanest way, maybe simply pod-public and prod-private for subnets? ex:

    # import a public pod subnet
    pod-public:
    us-east-1a:
        id: "subnet-1"
        cidr: "100.64.0.0/16"

    # import a private pod subnet
    pod-private:
    us-east-1b:
        id: "subnet-2"
        cidr: "100.65.0.0/16"

For customCniPodCidr do you think there'd only be one? @errordeveloper and I had discussed having an array of cidrs

I think this makes sense, it's more flexible and possibly even less work than I was thinking, if one wanted to use larger than /10 from the 100.64 range they could just add multiple cidrs as /16's or smaller (not that it would be restricted to 100.64, that's just a common use-case for non-routable pods). I will edit the request to reflect this.

lilley2412 avatar Aug 05 '19 21:08 lilley2412

@lilley2412 Yah, I can't think of a case, but who knows...can always handle public later if it's asked for. What do you think in terms of how to specify which set is for which nodegroup?

What we had so far in terms of specifying, in the same naming scheme as yours, is something like this...but I need to look at the code again to see if it grabs the main cidr by default through the whole thing as the way I originally had it is having pod subnets in line with cluster subnets, but need to change that to make it under it.

pod-private:
    group1:
        cidr: (optional)
        us-east-1a:
            id: "subnet-1"
            cidr: "100.64.0.0/16"
    group2:
        cidr: (optional)
        us-east-1a:
            id: "subnet-1"
            cidr: "100.65.0.0/16"

This way if ng1 wants group1 and ng2 wants group2. Then nodegroups would need some sort of designation to which to use. I need to also think on it more after moving pod subnets under cluster subnets to see where it grabs everything and what makes the most sense then.

tiffanyfay avatar Aug 05 '19 21:08 tiffanyfay

@lilley2412 For the case when it creates all the subnets and cluster, there's also the case there of we will need to be able to set a worker node CIDR for the case that they want it to be a non-RFC1918.

ATM I am having troubles getting the autogenerate to work for the deepcopy file.

tiffanyfay avatar Aug 06 '19 23:08 tiffanyfay

@tiffanyfay

What do you think in terms of how to specify which set is for which nodegroup?

At first I thought this wasn't required because ENIConfig just has to match the AZ of the node.

Poking around I found https://github.com/weaveworks/eksctl/issues/806 and https://github.com/weaveworks/eksctl/issues/475#issuecomment-498995479 (@errordeveloper indicated this suggestion would be implemented). If it's implemented this way, I think maybe it makes sense to extend on that config by nesting the pod subnets under the respective node subnets, because a pod subnet is meaningless without a corresponding node subnet. Something like this .. it would also take care of private / public and any supported combo of IP ranges, and doesn't require re-specifying the AZ to match the node:

vpc:
  subnets:
    private:
      data-access-subnet:
         cidr: 10.1.0.0/16
         az: us-east-1a
         pod-subnet:
            cidr: 100.64.0.0/16
      high-level-subnet:
         cidr: 10.2.0.0/16
         az: us-east-1a
         pod-subnet:
            cidr: 100.65.0.0/16

This would change how ENIConfig is resolved, since nodes don't have a subnet label already, something like:

  • ENI_CONFIG_LABEL_DEF=cni-networking/pod-subnet-id on aws-node daemonset
  • name ENIConfig by pod subnet id
  • set label cni-networking/pod-subnet-id on nodeGroup creation

This may need to also allow he same pod subnets to be used for two different node subnets, not sure if that is a useful use case though.

lilley2412 avatar Aug 07 '19 15:08 lilley2412

@lilley2412 Also, if you have two node groups and two custom subnet groups then you would need to specify which you want.

Hmm...This is what I had shared with @errordeveloper and he said something like that and to discuss naming later.

pod-private: #which could just be custom
    group1:
        cidr: (optional)
        us-east-1a:
            id: "subnet-1"
            cidr: "100.64.0.0/16"
    group2:
        cidr: (optional)
        us-east-1a:
            id: "subnet-1"
            cidr: "100.65.0.0/16"

And had figured for the nodegroup it would have a field for subnet groups. But then again, this doesn't cover the case of using these for worker node subnets...

It sounds like we need to check with Ilya as to whether he wants it specifically like https://github.com/weaveworks/eksctl/issues/475#issuecomment-498995479, but he is pretty swamped with IAM for pods.

The way you have it seems different. Her way seems to imply each named group is a single subnet. But that has the issue that you covered with if you want to make a group and have them created based on a cidr block. So would it create for every az if only the CIDR is listed. Seems really nested though.

Also, are you in any of the slack groups?

tiffanyfay avatar Aug 07 '19 21:08 tiffanyfay

Also, if you have two node groups and two custom subnet groups then you would need to specify which you want

For current-state eksctl I only see the ability to specify an AZ list on a nodeGroup, I don't see any way to specify a subnet, so that's why my first reaction is that it doesn't matter, if you can only specify AZ on the nodeGroup, then the node -> AZ -> ENConfig -> subnet association is enough because there is no way to control nodeGroup -> subnet (without doing it in the ASG outside of eksctl). So I was thinking this is only a concern if some of the other feature requests are getting implemented to add a subnet property to nodeGroup.

However it ends up I personally don't care too much so long as it's clear, so the approach you show is fine I think, but like I said unless I'm missing something I don't see the ng -> subnet association yet.

Also, are you in any of the slack groups?

I am not, but will take a look to catch up on the discussions.

lilley2412 avatar Aug 08 '19 20:08 lilley2412

@tiffanyfay thinking about this more, I guess the point of what you and @errordeveloper where proposing is even If nodeGroup -> worker subnet is not implemented yet, let's support for pod subnets with this feature? It probably makes sense to keep them decoupled even if it means two subnet properties on nodeGroup (or refactoring when worker subnets are implemented?)

So think it makes sense

pod-private: #which could just be custom
    group1:
        cidr: (optional)
        us-east-1a:
            id: "subnet-1"
            cidr: "100.64.0.0/16"
    group2:
        cidr: (optional)
        us-east-1a:
            id: "subnet-1"
            cidr: "100.65.0.0/16"

And some way to specify ng -> subnet

So I guess this also means don't use failure zone label for ENIConfig, it would just have to be a node label based on the specified subnet on the ng.

lilley2412 avatar Aug 08 '19 20:08 lilley2412

@lilley2412 Sounds good to me.

tiffanyfay avatar Aug 08 '19 20:08 tiffanyfay

Custom networking will also affect max-pods, https://github.com/aws/amazon-vpc-cni-k8s/issues/331. You'll need to update the map or use the formula in the issue I referenced.

jicowan avatar Aug 30 '19 13:08 jicowan

@jicowan at first I was thinking that it can be handled with --max-pods as a kubelet arg manually, assuming there would be a solution elsewhere. Doesn't look like that's easily achievable though since at node creation it cannot be known what may get configured with ENIConfig, so I agree eksctl should handle this during node group build. I can't see any good alternative to maintaining a file similar to amazon-eks-ami/eni-max-pods.txt though to support the formula logic.

lilley2412 avatar Sep 14 '19 18:09 lilley2412

Just noticed eksctl already https://github.com/weaveworks/eksctl/blob/master/pkg/nodebootstrap/maxpods_generate.go, based on https://github.com/awslabs/amazon-eks-ami/blob/master/files/eni-max-pods.txt, unfortunately the source file does not have the IPs per interface, so still need to do something further to have that available.

lilley2412 avatar Sep 23 '19 11:09 lilley2412

This issue is opened for almost 3 years @Himangini @martina-if @Callisto13 is there any ETA ? I want to avoid manual WA - https://www.arhea.net/posts/2020-06-19-amazon-eks-secondary-cidr.html

yanivpaz avatar Jun 30 '22 14:06 yanivpaz

This issue is opened for almost 3 years @Himangini @martina-if @Callisto13 is there any ETA ? I want to avoid manual WA - https://www.arhea.net/posts/2020-06-19-amazon-eks-secondary-cidr.html

@yanivpaz, we're occupied with higher-priority tasks at the moment but we'll review the proposal soon.

cPu1 avatar Jun 30 '22 14:06 cPu1

This issue is opened for almost 3 years @Himangini @martina-if @Callisto13 is there any ETA ? I want to avoid manual WA - https://www.arhea.net/posts/2020-06-19-amazon-eks-secondary-cidr.html

@yanivpaz thanks for showing interest in this issue. We are working towards prioritizing this. We're happy to accept PRs if you have a solution in mind 💡

@lilley2412 I know it's been a long time but are you still keen on this feature for eksctl?

Himangini avatar Jun 30 '22 14:06 Himangini

i can think of 1 use-case where this is desirable,

EKS as a service consumes alot of IPs,

if we're deploying into a VPC not with that in mind, we maybe faced with a situation where the subnets doesn't have enough IP to handle all the pod-IPs.

adelwin avatar Aug 12 '22 10:08 adelwin

This is a very important case that we provide alternate CIDR support without those manual eniconfigs and hassle . eksctl should be able to support that and also it helps us in overcoming two problems

  1. We don't want to pollute and waste lots of IPs in main subnet which we want to keep it smaller and not huge chunk like /18 or/19 just for EKS
  2. We want to get rid of those manual configurations to support 100.64 series of subnets .
  3. We have it lot easier to handle the pod cidrs available and extend the max available pods per ENI .

Not sure if I have missed anything . Please provide additional inputs to this .

defyjoy avatar Aug 13 '22 12:08 defyjoy

Some key differences in eksctl from when the feature was initially discussed and now when I'm writing this comment:

  • we can have multiple node subnets in the same AZ
  • we have a subnets field under nodegroup which allows specifying in which subnets the worker nodes will be created. This can be done either by subnet ID or alias.

With that in mind, I'm splitting the implementation in two cases:

1. Pod subnets are user defined: It falls under user's responsibility to define at least one pod subnet per each used AZ. Moreover, it is also user's responsibility to correctly assign subnets & pod subnets combinations to each nodegroup. This can / should be ensured by adding appropriate eksctl validations. Considering the fact that the previously discussed use cases assume that pod subnets shall be private, I propose the following high level config file structure:

vpc:
...
subnets:
   private:
   ...
   public:
   ...
   pod-subnets:
   ...

2. Pod subnets are not user defined, hence eksctl needs to create them: As mentioned above, this will be achieved by not providing any pod-subnets and instead specifying customCniPodCidrs. eksctl will be responsible for creating one private pod subnet, per pod cidr, per each AZ previously configured (either user defined or eksctl determined). For splitting up the CIDR ranges we already have a utility function. Additionally, during nodegroup creation, after assigning subnets, we shall track in which AZs the nodegroup will reside based on the set of subnets assigned. Thus, we shall also assign to the nodegroup, all pod subnets that are part of those AZs.

The other implementation steps (notes) presented in the opening comment are pretty much still relevant. All ENI config related steps shall be covered as part of a post cluster creation task.

TiberiuGC avatar Sep 25 '23 08:09 TiberiuGC

@TiberiuGC Any idea when this could be implemented, for example I want to use submariner but pod and node cannot be within the same range, of course I can try to customzie aws cni but would be nice to have that within eksctl.

shinji62 avatar Feb 22 '24 05:02 shinji62