karpenter-provider-aws
karpenter-provider-aws copied to clipboard
Karpenter doesn't factor existing node types when scaling up, causing production outages due to spot interruptions
Description
Observed Behavior: Karpenter is configured to be able to choose from 100+ instance types, but we've had several times in production where Karpenter ends up scaling on the same spot instance type, then lose ~90%+ of the cluster due to spot interruptions.
Tried to use topologySpreadConstraints
like so:
...
topologySpreadConstraints:
- maxSkew: 1
topologyKey: node.kubernetes.io/instance-type
whenUnsatisfiable: DoNotSchedule
labelSelector:
...
The hope was that this would cause Karpenter to use different instance types. Unfortunately this just cases Kubernetes to select from ALL the different instance types, resulting in pods that cannot be scheduled due to the NodePool
not having those types, because Kubernetes is not aware of the NodePool
and therefore what nodes it can actually use.
Expected Behavior: Karpenter will use diverse spot instance types to prevent production outages when another customer requests all on type of node (the same Karpenter is already using).
Reproduction Steps (Please include YAML): This is a simplified version from our production setup that reproduces the issues quickly.
One NodePool
that will have all m5*.large
instance-types available to use:
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
limits:
cpu: 100
template:
spec:
nodeClassRef:
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
name: default
requirements:
- key: kubernetes.io/arch
operator: In
values: [ 'amd64', 'arm64' ]
- key: karpenter.sh/capacity-type
operator: In
values: [ 'spot', 'on-demand' ]
- key: karpenter.k8s.aws/instance-category
operator: In
values: [ 'm' ]
- key: karpenter.k8s.aws/instance-generation
operator: In
values: [ '5' ]
- key: karpenter.k8s.aws/instance-size
operator: In
values: [ 'large' ]
- key: topology.kubernetes.io/zone
operator: In
values: [ 'region-1a', 'region-1b', 'region-1c' ]
A simple Deployment
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
namespace: default
labels:
app: nginx
spec:
replicas: 50
selector:
matchLabels:
app: nginx
template:
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: public.ecr.aws/nginx/nginx:stable
resources:
requests:
cpu: 1000m
memory: 50Mi
nodeSelector:
karpenter.sh/capacity-type: spot
topologySpreadConstraints:
- maxSkew: 1
topologyKey: node.kubernetes.io/instance-type
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: nginx
Key parts of the deployment are:
- Each pod will require another machine (2 VCPU
large
machines vs.cpu: 1000m
) - We ensure it's on spot (just to keep cost down for the experiment)
- And most importantly, we use
topologyKey: node.kubernetes.io/instance-type
for thetopologySpreadConstraints
.
Steps:
- Scale up the deployment to 50 replicas.
- Observe that only a few pods get scheduled, and the rest have
karpenter
messages highlighting the issue:node.kubernetes.io/instance-type In [x2iedn.2xlarge]
<- Kubernetes will request pods for every different instance-type, as per thetopologySpreadConstraints
. More examples:m6g.8xlarge
,r5n.8xlarge
,m7i.metal-48xl
,r7g.metal
,g3.8xlarge
and so on. Nodes in cluster (with pending, unschedulable pods):kubectl get nodes -L node.kubernetes.io/instance-type --no-headers | awk '{print $6}' | sort | uniq -c | sort 1 1 m5ad.large 1 m5a.large 1 m5d.large 1 m5.large 1 m5zn.large
- If we remove the
topologySpreadConstraints
then Karpenter will be able to bring up nodes, but they might all end up the same type:kubectl get nodes -L node.kubernetes.io/instance-type --no-headers | awk '{print $6}' | sort | uniq -c | sort 1 50 m5a.large
Example Karpenter log message:
{
"level": "ERROR",
"time": "2024-08-07T00:01:37.468Z",
"logger": "controller.provisioner",
"message": "Could not schedule pod, incompatible with nodepool \"default\", daemonset overhead={\"cpu\":\"150m\",\"memory\":\"345Mi\",\"pods\":\"4\"}, no instance type satisfied resources {\"cpu\":\"1150m\",\"memory\":\"395Mi\",\"pods\":\"5\"} and requirements karpenter.k8s.aws/instance-category In [m], karpenter.k8s.aws/instance-generation In [5], karpenter.k8s.aws/instance-size In [large], karpenter.sh/capacity-type In [spot], karpenter.sh/nodepool In [default], kubernetes.io/arch In [amd64 arm64], node.kubernetes.io/instance-type In [x2iedn.2xlarge], topology.kubernetes.io/zone In [region-1a region-1b region-1c] (no instance type met all requirements)",
"commit": "e719109",
"pod": "default/nginx-57db5c4db8-6k9sr"
}
Can't think of any way to configure Karpenter/workloads to stop this situation, other than running some on-demand, but even that's not a perfect soltuion ($$$ aside).
Versions:
- Chart Version:
0.36.2
- Kubernetes Version (
kubectl version
):Client Version: v1.30.3 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.6-eks-db838b0
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment