karpenter-provider-aws icon indicating copy to clipboard operation
karpenter-provider-aws copied to clipboard

Multi AZ Subnet to be selected only if there is available IPs

Open zakariais opened this issue 3 years ago • 28 comments

Version

Karpenter Version: v0.18.1 Kubernetes Version: v1.23

Expected Behavior

Karpenter should select a subnet with available IPs from all subnets available for EKS. We are facing a issue where subnet from one AZ is running out of IPs, we have one subnet for each AZs. I understand the subnet is chosen randomly, which is ok, but a given subnet might run out of IP, and it would be better if Karpenter selects subnet with (the most?) available IPs from all subnets available in a VPC.

Actual Behavior

If multiple subnets from different AZs are available, karpenter choose one randomly, without considering if the subnet has IPs available. And unfortunately, we have a situation where we are running out of IPs in specific availability zones and karpenter seems to be creating more instances on that zone without even considering other subnets in other AZs.

Steps to Reproduce the Problem

  1. Have multiple subnet that matches the subnetSelector from your provisioner for different AZs. One of those subnet must have no free IPs.
  2. Scale up a deployment to ensure Karpenter needs to create instance (I use inflate deployment from Getting started tutorial) and with a few tries, most probably you will end up with the provisioner to select this given subnet.

Resource Specs and Logs

Provisioner spec:

spec:
  kubeletConfiguration: {}
  labels:
    group: default
  limits: {}
  provider:
    apiVersion: extensions.karpenter.sh/v1alpha1
    instanceProfile: <our_instance_profile_name>
    kind: AWS
    launchTemplate: <our_launch_template_name>
    securityGroupSelector:
      karpenter.sh/cluster/clusterName: "owned"
    subnetSelector:
      karpenter.sh/cluster/clusterName: "owned"
  requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - spot
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - m5.4xlarge
  - key: topology.kubernetes.io/zone
    operator: In
    values:
    - us-east-1a
    - us-east-1b
    - us-east-1c
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
  ttlSecondsAfterEmpty: 30
  ttlSecondsUntilExpired: 86400

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

zakariais avatar Nov 23 '22 07:11 zakariais

Part of your problem comes from using SPOT, which will choose the cheapest AZ AZ prices can be dramatically different between them

the other part of the problem comes from https://github.com/aws/karpenter/issues/2572

you can consider using CGNAT to extend your IP range. using 100.64/19 AZs would increase your IP pool to values that you would never have an issue anytime soon

but yes, Karpenter could do a better job at avoiding AZs without IPs in the meantime you could use different provisioners , each with different AZs in them, and weight the provisioners, avoiding the more used AZs.

FernandoMiguel avatar Nov 23 '22 10:11 FernandoMiguel

We wouldn't want to override Karpenter choosing the cheapest AZ. We think that Karpenter should choose the cheapest AZ, but if the subnet is full -> fallback to the next AZ.

liorfranko avatar Nov 23 '22 15:11 liorfranko

I was just about to create a similar issue. Agreed with @liorfranko on picking the cheapest, but falling back to the next available. Also using spot and running out of ips. I have 3 subnets each in their own AZ and it mostly only allocates ips in the first AZ.

Screen Shot 2022-11-23 at 11 21 32 AM

tjhiggins avatar Nov 23 '22 16:11 tjhiggins

Class C CIDRs are too small for EKS given each pod eats one IP. Take a look https://aws.amazon.com/blogs/containers/addressing-ipv4-address-exhaustion-in-amazon-eks-clusters-using-private-nat-gateways/

FernandoMiguel avatar Nov 23 '22 17:11 FernandoMiguel

Class C CIDRs are too small for EKS given each pod eats one IP. Take a look https://aws.amazon.com/blogs/containers/addressing-ipv4-address-exhaustion-in-amazon-eks-clusters-using-private-nat-gateways/

Can you still expose services publicly if you use a private nat gateway? We use traefik behind an nlb on eks.

The Karpenter terraform install tutorial should probably be updated to use larger subnets then or use a private nat gateway.

tjhiggins avatar Nov 23 '22 17:11 tjhiggins

@tjhiggins "carrier grade NAT" is an industry name. It has nothing to do with AWS VPC Nat Gateways. Nothing changes in that part of the infrastructure.

Karpenter is agnostic. It's up to practitioners to architect their infrastructure. AWS Best Practices do recommend clients to use extended subnets so they don't run out of IPs

FernandoMiguel avatar Nov 23 '22 17:11 FernandoMiguel

given that Traefik is just another Pod running in the cluster, and the ALB/NLB will be talking to those Pods IPs, everything is perfectly reachable. nothing changes, other than having a private subnet IP in the CIDR range 10/16, your Pods will be in 100.64/19, while your EC2 ENIs keep the 10.x/16 IP

FernandoMiguel avatar Nov 23 '22 17:11 FernandoMiguel

@tjhiggins "carrier grade NAT" is an industry name. It has nothing to do with AWS VPC Nat Gateways. Nothing changes in that part of the infrastructure.

Karpenter is agnostic. It's up to practitioners to architect their infrastructure. AWS Best Practices do recommend clients to use extended subnets so they don't run out of IPs

I understand that Karpenter eventually wants to be agnostic, but it currently only supports aws and has documentation for creating your VPC which could be updated to use a private nat: https://karpenter.sh/v0.18.1/getting-started/getting-started-with-terraform/#create-a-cluster

Thank you for the suggestion on a private nat and I will give that a go.

tjhiggins avatar Nov 23 '22 17:11 tjhiggins

Karpenter does sort IPs for subnets that are in the same AZ. But when a request is made for capacity that doesn't constrain the AZ, we defer to EC2 Fleet to make that decision. Since Fleet is unaware of pods requiring IP addresses, it could make a decision where there's enough IPs for the node but not for pods. We could assume that pods will need IP addresses and limit subnets if they don't have enough IPs for max-pods +1 (probably through instance type offerings). This is probably safe in most cases, although doesn't make sense for cases where you are using an overlay network CNI.

If you are able to add more subnets in the same AZ, Karpenter should select the one with the most IPs available.

bwagner5 avatar Nov 24 '22 17:11 bwagner5

You can utilize Pod Topology Spread Contraints to help evenly distribute your workloads across AZs

James-Quigley avatar Jan 06 '23 14:01 James-Quigley

We have experienced the same issue with uneven distribution of nodes in AZ, You can see here Karpenter launching 10 nodes in the same 10.171.236.x subnet in the same AZ:

$ kc get nodes -l karpenter.sh/provisioner-name=default -o wide
NAME                                           STATUS   ROLES    AGE     VERSION                INTERNAL-IP      EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION                 CONTAINER-RUNTIME
ip-10-171-228-10.eu-west-1.compute.internal    Ready    <none>   7d3h    v1.22.15-eks-fb459a0   10.171.228.10    <none>        Amazon Linux 2   5.4.219-126.411.amzn2.x86_64   containerd://1.6.6
ip-10-171-230-123.eu-west-1.compute.internal   Ready    <none>   7h22m   v1.22.15-eks-fb459a0   10.171.230.123   <none>        Amazon Linux 2   5.4.219-126.411.amzn2.x86_64   containerd://1.6.6
ip-10-171-233-121.eu-west-1.compute.internal   Ready    <none>   31m     v1.22.15-eks-fb459a0   10.171.233.121   <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-233-203.eu-west-1.compute.internal   Ready    <none>   31m     v1.22.15-eks-fb459a0   10.171.233.203   <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-233-205.eu-west-1.compute.internal   Ready    <none>   31m     v1.22.15-eks-fb459a0   10.171.233.205   <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-233-30.eu-west-1.compute.internal    Ready    <none>   31m     v1.22.15-eks-fb459a0   10.171.233.30    <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-233-61.eu-west-1.compute.internal    Ready    <none>   4h12m   v1.22.15-eks-fb459a0   10.171.233.61    <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-236-109.eu-west-1.compute.internal   Ready    <none>   30m     v1.22.15-eks-fb459a0   10.171.236.109   <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-236-12.eu-west-1.compute.internal    Ready    <none>   30m     v1.22.15-eks-fb459a0   10.171.236.12    <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-236-126.eu-west-1.compute.internal   Ready    <none>   30m     v1.22.15-eks-fb459a0   10.171.236.126   <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-236-157.eu-west-1.compute.internal   Ready    <none>   30m     v1.22.15-eks-fb459a0   10.171.236.157   <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-236-191.eu-west-1.compute.internal   Ready    <none>   30m     v1.22.15-eks-fb459a0   10.171.236.191   <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-236-200.eu-west-1.compute.internal   Ready    <none>   30m     v1.22.15-eks-fb459a0   10.171.236.200   <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-236-243.eu-west-1.compute.internal   Ready    <none>   30m     v1.22.15-eks-fb459a0   10.171.236.243   <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-236-245.eu-west-1.compute.internal   Ready    <none>   30m     v1.22.15-eks-fb459a0   10.171.236.245   <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-236-64.eu-west-1.compute.internal    Ready    <none>   30m     v1.22.15-eks-fb459a0   10.171.236.64    <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-236-67.eu-west-1.compute.internal    Ready    <none>   30m     v1.22.15-eks-fb459a0   10.171.236.67    <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-239-179.eu-west-1.compute.internal   Ready    <none>   32m     v1.22.15-eks-fb459a0   10.171.239.179   <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-239-196.eu-west-1.compute.internal   Ready    <none>   32m     v1.22.15-eks-fb459a0   10.171.239.196   <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-239-30.eu-west-1.compute.internal    Ready    <none>   33m     v1.22.15-eks-fb459a0   10.171.239.30    <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-239-65.eu-west-1.compute.internal    Ready    <none>   32m     v1.22.15-eks-fb459a0   10.171.239.65    <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6

It seems like Karpenter decided 10 nodes are needed to allocate the pending pods, chose an AZ and I assume randomly chose a subnet from that AZ and launched all 10 of them in that one. While I agree topology spread constraint can help here I would still expect Karpenter to implement better randomization logic while choosing the subnet i.e. select 5 subnets and launch 2 nodes per each for example.

igoratencompass avatar Jan 09 '23 00:01 igoratencompass

Right now, the algorithm optimizes for cost.

We've heard this feedback a fair bit. One option is to inject an implicit zonal topology rule into each pod as part of scheduling, unless of course the user has defined a different topology rule. This will yield default spread behavior across workloads, resulting in rough capacity balance.

ellistarn avatar Jan 09 '23 16:01 ellistarn

Wouldn't custom networking (secondary subnets) solve this?

stevehipwell avatar Mar 31 '23 10:03 stevehipwell

Faced the same issue, a workaround I've implemented was to add a stage in my CD that queries the private subnets, retrieves the one with the lowest available IPs and inject it in a "NOT IN" affinity stanza to my helm chart.

CorianderCake avatar May 03 '23 15:05 CorianderCake

Hi we had the same issue in our EKS cluster, with a subnet with no more IP addresses left, and pod taking hours to start because the CNI was waiting for an available IP. We found some workloads not using spread tolerations so fixing this might help. However I agree that having such a feature would a nice improvement for Karpenter reliability !

dixneuf19 avatar May 11 '23 15:05 dixneuf19

The main issue that we've run into here is we have no idea exactly how many pods will schedule to a given node so when an AZ is near exhausting all of its available IPs, it's unclear whether we should exclude this AZ or whether we should allow it to pass through.

In the other case where we just try to prioritize the AZs that have more available IPs, we run into issues with cost optimization because the the AZ with more IPs may actually be more expensive.

In a way, this issue is very similar to #1292 where there is an ask to create an implicit provisioner-wide topologySpread on AZs.

jonathan-innis avatar May 17 '23 07:05 jonathan-innis

Wouldn't custom networking (secondary subnets) solve this?

It can help, but you can still run out of IPs. A price or availability difference between AZs can result in all nodes being launched in one AZ.

dougbyrne avatar Aug 07 '23 15:08 dougbyrne

Wouldn't custom networking (secondary subnets) solve this?

It can help, but you can still run out of IPs. A price or availability difference between AZs can result in all nodes being launched in one AZ.

@dougbyrne the custom networking guidance is to split a /16 CIDR block over your zones, so I'm not sure how you would run out of IPs while also not being able to satisfy a topology spread constraint or significantly impacting the instance pricing balance back to equilibrium?

stevehipwell avatar Aug 07 '23 15:08 stevehipwell

I might be thinking of a different feature. I've added additional subnets, but each subnet is still associated with a specific zone. The example given in the AWS docs does the same: https://docs.aws.amazon.com/eks/latest/userguide/cni-custom-network.html#custom-networking-configure-vpc

If I'm missing something I'd love to know because what you're describing is what I want.

dougbyrne avatar Aug 07 '23 16:08 dougbyrne

@dougbyrne you'd be creating a subnet per zone, but based on the recommended /16 CIDR from the CG-NAT space that's over 21k pods per AZ. I'd suggest looking at the EKS best practice guides and configuring both custom networking and IP prefix mode together; IMHO this should be the default configuration.

https://aws.github.io/aws-eks-best-practices/networking/custom-networking/ https://aws.github.io/aws-eks-best-practices/networking/prefix-mode/index_linux/

stevehipwell avatar Aug 07 '23 16:08 stevehipwell

Currently issue is I have let's say 2-3 subnets in each AZ (/24 each) the spot says az-1 has cheapest price. I end up with Karpenter seeing all the subnets for all az's (including multiple for az-1) but ends up exhausting 1 subnet completely. It is not distributing the nodes across all subnets in that particular subnet.

cdenneen avatar May 23 '24 18:05 cdenneen

@cdenneen it sounds like you need to configure topology spread constraints if you have more specific requirements than just the cheapest compute.

stevehipwell avatar May 30 '24 11:05 stevehipwell

@stevehipwell there is supposed to be backend issue the Karpenter team knows about that when spot is used it doesn't take into account IP exhaustion in the subnet and when 1f for example is deemed cheapest for spot but there are 2-3 subnets for 1f it exhausts one and not the others. Issue is the first 1f is used for the node and its pods by default but if multiple nodes in 1f (subnet 1) instead of split across the multiple 1f subnets. Support is working with Karpenter team on this.

cdenneen avatar May 30 '24 12:05 cdenneen

@cdenneen having Karpenter understand IP exhaustion seems like a sensible idea. But I'd expect it to be largely unnecessary for clusters with topology constraints configured as per recommendations. Unless maybe the AZs have widely different IP sizes.

I guess my point is are you down to your last couple of IPs on all AZs where Karpenter knowing the limits might help, or is your cluster heavily biased towards the AZ with the cheapest instances? Also is there a reason you can't use secondary networking?

Even if Karpenter were to understand IP limits, wouldn't the whole system break down once you had provisioned nodes with more availability for pods than there were free IPs? Karpenter doesn't control scheduling for pods onto existing nodes so this would be a K8s scheduler responsibility.

stevehipwell avatar May 30 '24 13:05 stevehipwell

@stevehipwell balanced usage of AZs does not ensure that the multiple subnets within a single AZ are used in a balanced way. Enhanced subnet discovery in the VPC ENI might help here.

dougbyrne avatar May 30 '24 14:05 dougbyrne

@dougbyrne I agree it could be useful. My point was that secondary networking would make it a non-issue. If secondary networking wasn't possible for some reason (I'm not sure I can think of one), then topology spread would likely take you as far as you'd get within the operation parameters of the K8s scheduler even if Karpenter was aware.

stevehipwell avatar May 30 '24 20:05 stevehipwell

In our case, we hit this problem because our teams create dozens of namespaces with single-replica pods for development. Topology spread doesn't work because A) the pods are distributed across many namespaces, and spread is only within a namespace, and B) each of the pods is a single replica and spread is really geared toward multiple replica pods. I guess we could try spread based on a shared label across all our service pods and see if that helps spread them across zones, but feels like it doesn't fit the intended use case for spread and might not work.

While we are trying to switch to use of a secondary CIDR and larger subnets, we are still concerned about the scenario of a zone failing and taking out the majority of our pods at once. Ideally, Karpenter could have a knob to force zone spread even if the spot pricing is better in one zone to avoid this scenario.

jessebye avatar Jul 25 '24 17:07 jessebye

@jessebye for you scenario it sounds like you could use a soft pod anti-affinity backed by a label.

If each of your singletons is given the singleton: true label then the following code would spread them out fairly evenly across your zones.

podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
              - key: singleton
              operator: Exists
          topologyKey: topology.kubernetes.io/zone

stevehipwell avatar Jul 26 '24 09:07 stevehipwell

It feels like there are two (maybe three) related but distinct issues being conflated in this discussion -- let me see if I can express this coherently:

  • Sometimes Karpenter provisions nodes in subnets that are close to IP exhaustion because it's choosing the strictly cheapest AZ and doesn't try to balance across subnets in that AZ.
    • There is some indication to me that some folks in this thread would like to spread across AZs within a region in response to aggregate IP availability across all of each zone's subnets, rather than just across subnets within a chosen AZ.
  • Karpenter doesn't in general try to balance new nodes across subnets/AZs.

I tend to agree that the native Kubernetes spread mechanisms are where code written in the response to the second main point should land -- overall workload spread itself feels out of scope for Karpenter to me; it should react to it but shouldn't try to enforce it.

In response to the first point and its sub-point, some sort of topologyKey-enabled IP availability term for subnet selection would be nice. I'm imagining something like this (extending the example in the documentation):

subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: "${CLUSTER_NAME}"
      environment: test
    ip-availability:
      minimumAvailable: [integer number of IPs, or percentage of total usable IPs]
      topologyKey: [zone/subnet]
  - id: subnet-09fa4a0a8f233a921

If the topology key is the zone, then zones are first filtered down to those that meet the minimum IP availability specified and any subnet that has any IPs available within the chosen zone may be selected. If the topology key is the subnet, then any subnet that itself meets the minimum IP availability in any zone may be selected. Separate terms could each specify a different topology key so you could, for example, limit selection to subnets that have at least 30 IPs available, in zones that have at least 10% of their available IPs unused.

omkensey avatar Dec 05 '24 16:12 omkensey