karpenter-provider-aws icon indicating copy to clipboard operation
karpenter-provider-aws copied to clipboard

Karpenter doesn't factor existing node types when scaling up, causing production outages due to spot interruptions

Open jtnz opened this issue 6 months ago • 7 comments

Description

Observed Behavior: Karpenter is configured to be able to choose from 100+ instance types, but we've had several times in production where Karpenter ends up scaling on the same spot instance type, then lose ~90%+ of the cluster due to spot interruptions.

Tried to use topologySpreadConstraints like so:

...
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: node.kubernetes.io/instance-type
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
          ...

The hope was that this would cause Karpenter to use different instance types. Unfortunately this just cases Kubernetes to select from ALL the different instance types, resulting in pods that cannot be scheduled due to the NodePool not having those types, because Kubernetes is not aware of the NodePool and therefore what nodes it can actually use.

Expected Behavior: Karpenter will use diverse spot instance types to prevent production outages when another customer requests all on type of node (the same Karpenter is already using).

Reproduction Steps (Please include YAML): This is a simplified version from our production setup that reproduces the issues quickly.

One NodePool that will have all m5*.large instance-types available to use:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  limits:
    cpu: 100
  template:
    spec:
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: default
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: [ 'amd64', 'arm64' ]
        - key: karpenter.sh/capacity-type
          operator: In
          values: [ 'spot', 'on-demand' ]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: [ 'm' ]
        - key: karpenter.k8s.aws/instance-generation
          operator: In
          values: [ '5' ]
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: [ 'large' ]
        - key: topology.kubernetes.io/zone
          operator: In
          values: [ 'region-1a', 'region-1b', 'region-1c' ]

A simple Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  namespace: default
  labels:
    app: nginx
spec:
  replicas: 50
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      name: nginx
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: public.ecr.aws/nginx/nginx:stable
          resources:
            requests:
              cpu: 1000m
              memory: 50Mi
      nodeSelector:
        karpenter.sh/capacity-type: spot
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: node.kubernetes.io/instance-type
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: nginx

Key parts of the deployment are:

  • Each pod will require another machine (2 VCPU large machines vs. cpu: 1000m)
  • We ensure it's on spot (just to keep cost down for the experiment)
  • And most importantly, we use topologyKey: node.kubernetes.io/instance-type for the topologySpreadConstraints.

Steps:

  1. Scale up the deployment to 50 replicas.
  2. Observe that only a few pods get scheduled, and the rest have karpenter messages highlighting the issue: node.kubernetes.io/instance-type In [x2iedn.2xlarge] <- Kubernetes will request pods for every different instance-type, as per the topologySpreadConstraints. More examples: m6g.8xlarge, r5n.8xlarge, m7i.metal-48xl, r7g.metal, g3.8xlarge and so on. Nodes in cluster (with pending, unschedulable pods):
    kubectl get nodes -L node.kubernetes.io/instance-type --no-headers | awk '{print $6}' | sort | uniq -c | sort
      1
      1 m5ad.large
      1 m5a.large
      1 m5d.large
      1 m5.large
      1 m5zn.large
    
  3. If we remove the topologySpreadConstraints then Karpenter will be able to bring up nodes, but they might all end up the same type:
    kubectl get nodes -L node.kubernetes.io/instance-type --no-headers | awk '{print $6}' | sort | uniq -c | sort
      1
     50 m5a.large
    

Example Karpenter log message:

{
  "level": "ERROR",
  "time": "2024-08-07T00:01:37.468Z",
  "logger": "controller.provisioner",
  "message": "Could not schedule pod, incompatible with nodepool \"default\", daemonset overhead={\"cpu\":\"150m\",\"memory\":\"345Mi\",\"pods\":\"4\"}, no instance type satisfied resources {\"cpu\":\"1150m\",\"memory\":\"395Mi\",\"pods\":\"5\"} and requirements karpenter.k8s.aws/instance-category In [m], karpenter.k8s.aws/instance-generation In [5], karpenter.k8s.aws/instance-size In [large], karpenter.sh/capacity-type In [spot], karpenter.sh/nodepool In [default], kubernetes.io/arch In [amd64 arm64], node.kubernetes.io/instance-type In [x2iedn.2xlarge], topology.kubernetes.io/zone In [region-1a region-1b region-1c] (no instance type met all requirements)",
  "commit": "e719109",
  "pod": "default/nginx-57db5c4db8-6k9sr"
}

Can't think of any way to configure Karpenter/workloads to stop this situation, other than running some on-demand, but even that's not a perfect soltuion ($$$ aside).

Versions:

  • Chart Version:
    0.36.2
    
  • Kubernetes Version (kubectl version):
    Client Version: v1.30.3
    Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
    Server Version: v1.29.6-eks-db838b0
    
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

jtnz avatar Aug 07 '24 01:08 jtnz