karpenter
karpenter copied to clipboard
Karpenter consolidating nodes that are required by podAntiAffinity, then recreates them
Description
Observed Behavior:
Context
Cluster running:
- EKS v1.29.3
- Karpenter 0.36.1 on Fargate nodes
- CoreDNS on Fargate
- Kube-proxy as daemonset
- Cilium Agent as daemonset
- Cilium Operator running on Karpenter spot/on-demand nodes
Nothing else is running on the cluster.
Nodepool for ARM64 with spot and on-demand, consolidating WhenUnderutilized
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
annotations:
karpenter.sh/nodepool-hash: "12393960163388511505"
karpenter.sh/nodepool-hash-version: v2
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"karpenter.sh/v1beta1","kind":"NodePool","metadata":{"annotations":{},"name":"arm64"},"spec":{"disruption":{"consolidationPolicy":"WhenUnderutilized"},"limits":{"cpu":100},"startupTaints":[{"effect":"NoExecute","key":"node.cilium.io/agent-not-ready","value":"true"}],"template":{"spec":{"nodeClassRef":{"name":"default"},"requirements":[{"key":"kubernetes.io/arch","operator":"In","values":["arm64"]},{"key":"kubernetes.io/os","operator":"In","values":["linux"]},{"key":"karpenter.sh/capacity-type","operator":"In","values":["spot","on-demand"]},{"key":"karpenter.k8s.aws/instance-category","operator":"In","values":["c","m","r"]},{"key":"karpenter.k8s.aws/instance-generation","operator":"Gt","values":["2"]},{"key":"karpenter.k8s.aws/instance-hypervisor","operator":"In","values":["nitro"]}]}},"weight":20}}
creationTimestamp: "2024-05-13T10:27:32Z"
generation: 1
name: arm64
resourceVersion: "2778670"
uid: 4c5ac6ca-981d-42aa-a1b3-897328d4c4ae
spec:
disruption:
budgets:
- nodes: 10%
consolidationPolicy: WhenUnderutilized
expireAfter: 720h
limits:
cpu: 100
template:
spec:
nodeClassRef:
name: default
requirements:
- key: kubernetes.io/arch
operator: In
values:
- arm64
- key: kubernetes.io/os
operator: In
values:
- linux
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
- on-demand
- key: karpenter.k8s.aws/instance-category
operator: In
values:
- c
- m
- r
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values:
- "2"
- key: karpenter.k8s.aws/instance-hypervisor
operator: In
values:
- nitro
weight: 20
status:
resources:
cpu: "2"
ephemeral-storage: 40828Mi
memory: 3747112Ki
pods: "16"
EC2NodeClass
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
annotations:
karpenter.k8s.aws/ec2nodeclass-hash: "12202273397587251031"
karpenter.k8s.aws/ec2nodeclass-hash-version: v2
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"karpenter.k8s.aws/v1beta1","kind":"EC2NodeClass","metadata":{"annotations":{},"name":"default"},"spec":{"amiFamily":"Bottlerocket","role":"Karpenter-my-cluster-20240505215019067700000012","securityGroupSelectorTerms":[{"tags":{"karpenter.sh/discovery":"my-cluster"}}],"subnetSelectorTerms":[{"tags":{"used_by":"workload"}}],"tags":{"karpenter.sh/discovery":"my-cluster"}}}
creationTimestamp: "2024-05-05T22:02:24Z"
finalizers:
- karpenter.k8s.aws/termination
generation: 3
name: default
resourceVersion: "2778684"
uid: 3ec49688-0dc8-4304-a98a-ab2487164114
spec:
amiFamily: Bottlerocket
metadataOptions:
httpEndpoint: enabled
httpProtocolIPv6: disabled
httpPutResponseHopLimit: 2
httpTokens: required
role: Karpenter-my-cluster-20240505215019067700000012
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
subnetSelectorTerms:
- tags:
used_by: workload
tags:
karpenter.sh/discovery: my-cluster
status:
amis:
- id: ami-0bd2a487af33a3393
name: bottlerocket-aws-k8s-1.29-nvidia-x86_64-v1.19.5-64049ba8
requirements:
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.k8s.aws/instance-gpu-count
operator: Exists
- id: ami-0bd2a487af33a3393
name: bottlerocket-aws-k8s-1.29-nvidia-x86_64-v1.19.5-64049ba8
requirements:
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.k8s.aws/instance-accelerator-count
operator: Exists
- id: ami-073d119934a97d4af
name: bottlerocket-aws-k8s-1.29-nvidia-aarch64-v1.19.5-64049ba8
requirements:
- key: kubernetes.io/arch
operator: In
values:
- arm64
- key: karpenter.k8s.aws/instance-gpu-count
operator: Exists
- id: ami-073d119934a97d4af
name: bottlerocket-aws-k8s-1.29-nvidia-aarch64-v1.19.5-64049ba8
requirements:
- key: kubernetes.io/arch
operator: In
values:
- arm64
- key: karpenter.k8s.aws/instance-accelerator-count
operator: Exists
- id: ami-0e0aad57ffa366ea6
name: bottlerocket-aws-k8s-1.29-aarch64-v1.19.5-64049ba8
requirements:
- key: kubernetes.io/arch
operator: In
values:
- arm64
- key: karpenter.k8s.aws/instance-gpu-count
operator: DoesNotExist
- key: karpenter.k8s.aws/instance-accelerator-count
operator: DoesNotExist
- id: ami-014c64942edcf14c6
name: bottlerocket-aws-k8s-1.29-x86_64-v1.19.5-64049ba8
requirements:
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.k8s.aws/instance-gpu-count
operator: DoesNotExist
- key: karpenter.k8s.aws/instance-accelerator-count
operator: DoesNotExist
instanceProfile: my-cluster_14408108118530556388
securityGroups:
- id: sg-010ff19951c82ae9f
name: my-cluster-node-20240505214129533800000001
- id: sg-0bdb4d959cbfb7951
name: my-cluster-cluster-20240505214129534200000002
- id: sg-0f946be9e2fd46d04
name: eks-cluster-sg-my-cluster-1909614887
subnets:
- id: subnet-05d32764d86645112
zone: eu-central-1b
- id: subnet-08a50cd900b173056
zone: eu-central-1c
- id: subnet-04e892f4c7dc778f6
zone: eu-central-1a
Issue 1: Launching a new deployment overshoots in node creation, then consolidates too many
When running a cluster with a random deployment of replica size 2 with a podAntiAffinity requiredDuringSchedulingIgnoredDuringExecution on the hostname key, one of the two pods deployed is scheduled on a different node. This works great, and 2 node claims are made for 2 medium-sized spot nodes.
NAME TYPE ZONE NODE READY AGE CAPACITY NODEPOOL NODECLASS
arm64-6qln6 c6gn.medium eu-central-1c ip-10-2-170-180.eu-central-1.compute.internal True 5m22s spot arm64 default
arm64-7v8tq c6gn.medium eu-central-1c ip-10-2-162-12.eu-central-1.compute.internal True 3m14s spot arm64 default
The deployment YAML responsible for the two nodes (of Cilium operator)
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "3"
meta.helm.sh/release-name: cilium
meta.helm.sh/release-namespace: cilium
creationTimestamp: "2024-05-13T15:11:40Z"
generation: 3
labels:
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: cilium-operator
app.kubernetes.io/part-of: cilium
io.cilium/app: operator
name: cilium-operator
name: cilium-operator
namespace: cilium
resourceVersion: "2760987"
uid: 7f7a096d-b961-4988-9fc2-ce9f2a2a0e96
spec:
progressDeadlineSeconds: 600
replicas: 2
revisionHistoryLimit: 10
selector:
matchLabels:
io.cilium/app: operator
name: cilium-operator
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 50%
type: RollingUpdate
template:
metadata:
annotations:
prometheus.io/port: "9963"
prometheus.io/scrape: "true"
creationTimestamp: null
creationTimestamp: null
labels:
app.kubernetes.io/name: cilium-operator
app.kubernetes.io/part-of: cilium
io.cilium/app: operator
name: cilium-operator
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
io.cilium/app: operator
topologyKey: kubernetes.io/hostname
automountServiceAccountToken: true
containers:
- args:
- --config-dir=/tmp/cilium/config-map
- --debug=$(CILIUM_DEBUG)
command:
- cilium-operator-aws
env:
- name: K8S_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: CILIUM_K8S_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: CILIUM_DEBUG
valueFrom:
configMapKeyRef:
key: debug
name: cilium-config
optional: true
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
key: AWS_ACCESS_KEY_ID
name: cilium-aws
optional: true
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
key: AWS_SECRET_ACCESS_KEY
name: cilium-aws
optional: true
- name: AWS_DEFAULT_REGION
valueFrom:
secretKeyRef:
key: AWS_DEFAULT_REGION
name: cilium-aws
optional: true
image: quay.io/cilium/operator-aws:v1.15.4@sha256:8675486ce8938333390c37302af162ebd12aaebc08eeeaf383bfb73128143fa9
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
host: 127.0.0.1
path: /healthz
port: 9234
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 3
name: cilium-operator
ports:
- containerPort: 9963
hostPort: 9963
name: prometheus
protocol: TCP
readinessProbe:
failureThreshold: 5
httpGet:
host: 127.0.0.1
path: /healthz
port: 9234
scheme: HTTP
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 3
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- mountPath: /tmp/cilium/config-map
name: cilium-config-path
readOnly: true
dnsPolicy: ClusterFirst
hostNetwork: true
nodeSelector:
kubernetes.io/os: linux
priorityClassName: system-cluster-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: cilium-operator
serviceAccountName: cilium-operator
terminationGracePeriodSeconds: 30
tolerations:
- operator: Exists
volumes:
- configMap:
defaultMode: 420
name: cilium-config
name: cilium-config-path
status:
availableReplicas: 2
conditions:
- lastTransitionTime: "2024-05-13T15:11:40Z"
lastUpdateTime: "2024-05-13T15:21:13Z"
message: ReplicaSet "cilium-operator-786bd475cf" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
- lastTransitionTime: "2024-05-14T08:18:31Z"
lastUpdateTime: "2024-05-14T08:18:31Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
observedGeneration: 3
readyReplicas: 2
replicas: 2
updatedReplicas: 2
When creating an extra deployment of 1 replica with a node-selector karpenter.sh/capacity-type: on-demand
spinning up an Ubuntu image with /bin/sleep, a new node is created.
The deployment YAML
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "14"
creationTimestamp: "2024-05-13T08:39:02Z"
generation: 32
labels:
app: sleeperpods
name: sleeperpods
namespace: default
resourceVersion: "2789532"
uid: a4bccac9-4862-44e1-87ab-48c8f24c2308
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: sleeperpods
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: sleeperpods
spec:
containers:
- command:
- /bin/sleep
- infinity
image: ubuntu:latest
imagePullPolicy: Always
name: ubuntu
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
nodeSelector:
karpenter.sh/capacity-type: on-demand
kubernetes.io/arch: arm64
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
status:
conditions:
- lastTransitionTime: "2024-05-13T08:39:02Z"
lastUpdateTime: "2024-05-14T09:17:08Z"
message: ReplicaSet "sleeperpods-5f765c8bb5" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
- lastTransitionTime: "2024-05-14T09:58:47Z"
lastUpdateTime: "2024-05-14T09:58:47Z"
message: Deployment does not have minimum availability.
reason: MinimumReplicasUnavailable
status: "False"
type: Available
observedGeneration: 32
replicas: 1
unavailableReplicas: 1
updatedReplicas: 1
Karpenter creates two nodeclaims for the single pod that is created.
NAME TYPE ZONE NODE READY AGE CAPACITY NODEPOOL NODECLASS
arm64-6qln6 c6gn.medium eu-central-1c ip-10-2-170-180.eu-central-1.compute.internal True 13m spot arm64 default
arm64-7v8tq c6gn.medium eu-central-1c ip-10-2-162-12.eu-central-1.compute.internal True 11m spot arm64 default
arm64-hml7n c6g.medium eu-central-1b ip-10-2-98-10.eu-central-1.compute.internal True 57s on-demand arm64 default
arm64-z6clx c6g.medium eu-central-1a ip-10-2-10-19.eu-central-1.compute.internal False 27s on-demand arm64 default
When the second node is ready, Karpenter removes one of the spot nodes as it should. However, it also removes the extra created on-demand nodeclaim. This is not a large issue, but it didn't need that extra node.
NAME TYPE ZONE NODE READY AGE CAPACITY NODEPOOL NODECLASS
arm64-7v8tq c6gn.medium eu-central-1c ip-10-2-162-12.eu-central-1.compute.internal True 12m spot arm64 default
arm64-hml7n c6g.medium eu-central-1b ip-10-2-98-10.eu-central-1.compute.internal True 2m9s on-demand arm64 default
Then, Karpenter decides to consolidate the spot node into the on-demand node.
NAME TYPE ZONE NODE READY AGE CAPACITY NODEPOOL NODECLASS
arm64-hml7n c6g.medium eu-central-1b ip-10-2-98-10.eu-central-1.compute.internal True 2m45s on-demand arm64 default
Until Karpenter figures out there was workload that required the second spot node.
NAME TYPE ZONE NODE READY AGE CAPACITY NODEPOOL NODECLASS
arm64-hml7n c6g.medium eu-central-1b ip-10-2-98-10.eu-central-1.compute.internal True 10m on-demand arm64 default
arm64-jm7p7 c6gn.medium eu-central-1c ip-10-2-131-28.eu-central-1.compute.internal True 7m31s spot arm64 default
JSON Logs
{"level":"DEBUG","time":"2024-05-14T09:58:48.524Z","logger":"controller.provisioner","message":"63 out of 704 instance types were excluded because they would breach limits","commit":"fb4d75f","nodepool":"arm64"}
{"level":"INFO","time":"2024-05-14T09:58:48.627Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"default/sleeperpods-5f765c8bb5-wlqs9","duration":"181.667496ms"}
{"level":"INFO","time":"2024-05-14T09:58:48.627Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"fb4d75f","nodeclaims":1,"pods":1}
{"level":"INFO","time":"2024-05-14T09:58:48.736Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"fb4d75f","nodepool":"arm64","nodeclaim":"arm64-hml7n","requests":{"cpu":"200m","memory":"10Mi","pods":"3"},"instance-types":"c6g.2xlarge, c6g.4xlarge, c6g.large, c6g.medium, c6g.xlarge and 55 other(s)"}
{"level":"DEBUG","time":"2024-05-14T09:58:49.271Z","logger":"controller.nodeclaim.lifecycle","message":"created launch template","commit":"fb4d75f","nodeclaim":"arm64-hml7n","launch-template-name":"karpenter.k8s.aws/17069018361932924887","id":"lt-01e0ed1bba5c31746"}
{"level":"DEBUG","time":"2024-05-14T09:58:49.411Z","logger":"controller.nodeclaim.lifecycle","message":"created launch template","commit":"fb4d75f","nodeclaim":"arm64-hml7n","launch-template-name":"karpenter.k8s.aws/1769404244630248950","id":"lt-0f67c6cad010586cb"}
{"level":"DEBUG","time":"2024-05-14T09:58:49.555Z","logger":"controller.nodeclaim.lifecycle","message":"created launch template","commit":"fb4d75f","nodeclaim":"arm64-hml7n","launch-template-name":"karpenter.k8s.aws/901192418858500786","id":"lt-0c6a1fdda9dd9df37"}
{"level":"DEBUG","time":"2024-05-14T09:58:49.693Z","logger":"controller.nodeclaim.lifecycle","message":"created launch template","commit":"fb4d75f","nodeclaim":"arm64-hml7n","launch-template-name":"karpenter.k8s.aws/11959477876247669564","id":"lt-065fec8ac34c325a5"}
{"level":"INFO","time":"2024-05-14T09:58:51.700Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"fb4d75f","nodeclaim":"arm64-hml7n","provider-id":"aws:///eu-central-1b/i-0c33c9484ccbf1b09","instance-type":"c6g.medium","zone":"eu-central-1b","capacity-type":"on-demand","allocatable":{"cpu":"940m","ephemeral-storage":"17Gi","memory":"1392Mi","pods":"8","vpc.amazonaws.com/pod-eni":"4"}}
{"level":"DEBUG","time":"2024-05-14T09:58:55.080Z","logger":"controller.disruption","message":"discovered subnets","commit":"fb4d75f","subnets":["subnet-05d32764d86645112 (eu-central-1b)","subnet-04e892f4c7dc778f6 (eu-central-1a)","subnet-08a50cd900b173056 (eu-central-1c)"]}
{"level":"INFO","time":"2024-05-14T09:58:58.552Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"default/sleeperpods-5f765c8bb5-wlqs9","duration":"106.792627ms"}
{"level":"INFO","time":"2024-05-14T09:59:08.531Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"default/sleeperpods-5f765c8bb5-wlqs9","duration":"84.759512ms"}
{"level":"INFO","time":"2024-05-14T09:59:10.052Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","commit":"fb4d75f","nodeclaim":"arm64-hml7n","provider-id":"aws:///eu-central-1b/i-0c33c9484ccbf1b09","node":"ip-10-2-98-10.eu-central-1.compute.internal"}
{"level":"DEBUG","time":"2024-05-14T09:59:18.527Z","logger":"controller.provisioner","message":"63 out of 704 instance types were excluded because they would breach limits","commit":"fb4d75f","nodepool":"arm64"}
{"level":"INFO","time":"2024-05-14T09:59:18.539Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"default/sleeperpods-5f765c8bb5-wlqs9","duration":"92.192936ms"}
{"level":"INFO","time":"2024-05-14T09:59:18.539Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"fb4d75f","nodeclaims":1,"pods":1}
{"level":"INFO","time":"2024-05-14T09:59:18.550Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"fb4d75f","nodepool":"arm64","nodeclaim":"arm64-z6clx","requests":{"cpu":"200m","memory":"10Mi","pods":"3"},"instance-types":"c6g.2xlarge, c6g.4xlarge, c6g.large, c6g.medium, c6g.xlarge and 55 other(s)"}
{"level":"INFO","time":"2024-05-14T09:59:20.571Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"fb4d75f","nodeclaim":"arm64-z6clx","provider-id":"aws:///eu-central-1a/i-0f5a68d2ae36d681d","instance-type":"c6g.medium","zone":"eu-central-1a","capacity-type":"on-demand","allocatable":{"cpu":"940m","ephemeral-storage":"17Gi","memory":"1392Mi","pods":"8","vpc.amazonaws.com/pod-eni":"4"}}
{"level":"INFO","time":"2024-05-14T09:59:28.527Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"default/sleeperpods-5f765c8bb5-wlqs9","duration":"78.532912ms"}
{"level":"INFO","time":"2024-05-14T09:59:32.089Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"fb4d75f","nodeclaim":"arm64-hml7n","provider-id":"aws:///eu-central-1b/i-0c33c9484ccbf1b09","node":"ip-10-2-98-10.eu-central-1.compute.internal","allocatable":{"cpu":"940m","ephemeral-storage":"18191325562","hugepages-1Gi":"0","hugepages-2Mi":"0","hugepages-32Mi":"0","hugepages-64Ki":"0","memory":"1419924Ki","pods":"8"}}
{"level":"INFO","time":"2024-05-14T09:59:40.360Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","commit":"fb4d75f","nodeclaim":"arm64-z6clx","provider-id":"aws:///eu-central-1a/i-0f5a68d2ae36d681d","node":"ip-10-2-10-19.eu-central-1.compute.internal"}
{"level":"DEBUG","time":"2024-05-14T09:59:51.534Z","logger":"controller.disruption","message":"abandoning empty node consolidation attempt due to pod churn, command is no longer valid, delete, terminating 1 nodes (0 pods) ip-10-2-98-10.eu-central-1.compute.internal/c6g.medium/on-demand","commit":"fb4d75f"}
{"level":"DEBUG","time":"2024-05-14T09:59:56.131Z","logger":"controller.nodeclaim.disruption","message":"discovered subnets","commit":"fb4d75f","nodeclaim":"arm64-z6clx","subnets":["subnet-08a50cd900b173056 (eu-central-1c)","subnet-05d32764d86645112 (eu-central-1b)","subnet-04e892f4c7dc778f6 (eu-central-1a)"]}
{"level":"INFO","time":"2024-05-14T10:00:03.905Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"fb4d75f","nodeclaim":"arm64-z6clx","provider-id":"aws:///eu-central-1a/i-0f5a68d2ae36d681d","node":"ip-10-2-10-19.eu-central-1.compute.internal","allocatable":{"cpu":"940m","ephemeral-storage":"18191325562","hugepages-1Gi":"0","hugepages-2Mi":"0","hugepages-32Mi":"0","hugepages-64Ki":"0","memory":"1419924Ki","pods":"8"}}
{"level":"INFO","time":"2024-05-14T10:00:06.937Z","logger":"controller.disruption","message":"disrupting via consolidation delete, terminating 1 nodes (1 pods) ip-10-2-170-180.eu-central-1.compute.internal/c6gn.medium/spot","commit":"fb4d75f","command-id":"0c07b1df-f22e-45bd-a6e6-1a57adadac77"}
{"level":"INFO","time":"2024-05-14T10:00:07.177Z","logger":"controller.disruption.queue","message":"command succeeded","commit":"fb4d75f","command-id":"0c07b1df-f22e-45bd-a6e6-1a57adadac77"}
{"level":"INFO","time":"2024-05-14T10:00:07.231Z","logger":"controller.node.termination","message":"tainted node","commit":"fb4d75f","node":"ip-10-2-170-180.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-05-14T10:00:07.247Z","logger":"controller.node.termination","message":"deleted node","commit":"fb4d75f","node":"ip-10-2-170-180.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-05-14T10:00:07.668Z","logger":"controller.nodeclaim.termination","message":"deleted nodeclaim","commit":"fb4d75f","nodeclaim":"arm64-6qln6","node":"ip-10-2-170-180.eu-central-1.compute.internal","provider-id":"aws:///eu-central-1c/i-018ce2089c5040a4e"}
{"level":"INFO","time":"2024-05-14T10:00:32.041Z","logger":"controller.disruption","message":"disrupting via consolidation delete, terminating 1 nodes (0 pods) ip-10-2-10-19.eu-central-1.compute.internal/c6g.medium/on-demand","commit":"fb4d75f","command-id":"e1d993f4-ada1-4a7e-8b21-7e8107aa579a"}
{"level":"INFO","time":"2024-05-14T10:00:32.325Z","logger":"controller.disruption.queue","message":"command succeeded","commit":"fb4d75f","command-id":"e1d993f4-ada1-4a7e-8b21-7e8107aa579a"}
{"level":"INFO","time":"2024-05-14T10:00:32.523Z","logger":"controller.node.termination","message":"tainted node","commit":"fb4d75f","node":"ip-10-2-10-19.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-05-14T10:00:32.624Z","logger":"controller.node.termination","message":"deleted node","commit":"fb4d75f","node":"ip-10-2-10-19.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-05-14T10:00:33.048Z","logger":"controller.nodeclaim.termination","message":"deleted nodeclaim","commit":"fb4d75f","nodeclaim":"arm64-z6clx","node":"ip-10-2-10-19.eu-central-1.compute.internal","provider-id":"aws:///eu-central-1a/i-0f5a68d2ae36d681d"}
{"level":"DEBUG","time":"2024-05-14T10:01:00.646Z","logger":"controller.disruption","message":"discovered subnets","commit":"fb4d75f","subnets":["subnet-05d32764d86645112 (eu-central-1b)","subnet-04e892f4c7dc778f6 (eu-central-1a)","subnet-08a50cd900b173056 (eu-central-1c)"]}
{"level":"INFO","time":"2024-05-14T10:01:00.934Z","logger":"controller.disruption","message":"disrupting via consolidation delete, terminating 1 nodes (1 pods) ip-10-2-162-12.eu-central-1.compute.internal/c6gn.medium/spot","commit":"fb4d75f","command-id":"9e2e395d-93ce-4a51-b4ab-3efde9eb2c5d"}
{"level":"INFO","time":"2024-05-14T10:01:01.449Z","logger":"controller.disruption.queue","message":"command succeeded","commit":"fb4d75f","command-id":"9e2e395d-93ce-4a51-b4ab-3efde9eb2c5d"}
{"level":"INFO","time":"2024-05-14T10:01:01.484Z","logger":"controller.node.termination","message":"tainted node","commit":"fb4d75f","node":"ip-10-2-162-12.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-05-14T10:01:01.510Z","logger":"controller.node.termination","message":"deleted node","commit":"fb4d75f","node":"ip-10-2-162-12.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-05-14T10:01:01.950Z","logger":"controller.nodeclaim.termination","message":"deleted nodeclaim","commit":"fb4d75f","nodeclaim":"arm64-7v8tq","node":"ip-10-2-162-12.eu-central-1.compute.internal","provider-id":"aws:///eu-central-1c/i-0d78f4210dbc5b344"}
{"level":"DEBUG","time":"2024-05-14T10:01:04.731Z","logger":"controller","message":"deleted launch template","commit":"fb4d75f","id":"lt-01e0ed1bba5c31746","name":"karpenter.k8s.aws/17069018361932924887"}
{"level":"DEBUG","time":"2024-05-14T10:01:04.821Z","logger":"controller","message":"deleted launch template","commit":"fb4d75f","id":"lt-0f67c6cad010586cb","name":"karpenter.k8s.aws/1769404244630248950"}
{"level":"DEBUG","time":"2024-05-14T10:01:04.934Z","logger":"controller","message":"deleted launch template","commit":"fb4d75f","id":"lt-0c6a1fdda9dd9df37","name":"karpenter.k8s.aws/901192418858500786"}
{"level":"DEBUG","time":"2024-05-14T10:01:05.028Z","logger":"controller","message":"deleted launch template","commit":"fb4d75f","id":"lt-065fec8ac34c325a5","name":"karpenter.k8s.aws/11959477876247669564"}
{"level":"DEBUG","time":"2024-05-14T10:01:45.832Z","logger":"controller.provisioner","message":"63 out of 704 instance types were excluded because they would breach limits","commit":"fb4d75f","nodepool":"arm64"}
{"level":"INFO","time":"2024-05-14T10:01:45.846Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"cilium/cilium-operator-786bd475cf-pkffm","duration":"70.542187ms"}
{"level":"INFO","time":"2024-05-14T10:01:45.846Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"fb4d75f","nodeclaims":1,"pods":1}
{"level":"INFO","time":"2024-05-14T10:01:45.859Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"fb4d75f","nodepool":"arm64","nodeclaim":"arm64-jm7p7","requests":{"cpu":"200m","memory":"10Mi","pods":"3"},"instance-types":"c6g.2xlarge, c6g.4xlarge, c6g.large, c6g.medium, c6g.xlarge and 55 other(s)"}
{"level":"DEBUG","time":"2024-05-14T10:01:46.095Z","logger":"controller.nodeclaim.lifecycle","message":"created launch template","commit":"fb4d75f","nodeclaim":"arm64-jm7p7","launch-template-name":"karpenter.k8s.aws/15606164045452257729","id":"lt-09c8264473b436381"}
{"level":"DEBUG","time":"2024-05-14T10:01:46.384Z","logger":"controller.nodeclaim.lifecycle","message":"created launch template","commit":"fb4d75f","nodeclaim":"arm64-jm7p7","launch-template-name":"karpenter.k8s.aws/17404147073338961046","id":"lt-0aeeacaba51557d60"}
{"level":"INFO","time":"2024-05-14T10:01:48.408Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"fb4d75f","nodeclaim":"arm64-jm7p7","provider-id":"aws:///eu-central-1c/i-045a9ed0b25b46530","instance-type":"c6gn.medium","zone":"eu-central-1c","capacity-type":"spot","allocatable":{"cpu":"940m","ephemeral-storage":"17Gi","memory":"1392Mi","pods":"8","vpc.amazonaws.com/pod-eni":"4"}}
{"level":"INFO","time":"2024-05-14T10:01:55.831Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"cilium/cilium-operator-786bd475cf-pkffm","duration":"54.386691ms"}
{"level":"DEBUG","time":"2024-05-14T10:02:02.391Z","logger":"controller.disruption","message":"discovered subnets","commit":"fb4d75f","subnets":["subnet-05d32764d86645112 (eu-central-1b)","subnet-04e892f4c7dc778f6 (eu-central-1a)","subnet-08a50cd900b173056 (eu-central-1c)"]}
{"level":"INFO","time":"2024-05-14T10:02:05.828Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"cilium/cilium-operator-786bd475cf-pkffm","duration":"50.863908ms"}
{"level":"INFO","time":"2024-05-14T10:02:07.719Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","commit":"fb4d75f","nodeclaim":"arm64-jm7p7","provider-id":"aws:///eu-central-1c/i-045a9ed0b25b46530","node":"ip-10-2-131-28.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-05-14T10:02:31.333Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"fb4d75f","nodeclaim":"arm64-jm7p7","provider-id":"aws:///eu-central-1c/i-045a9ed0b25b46530","node":"ip-10-2-131-28.eu-central-1.compute.internal","allocatable":{"cpu":"940m","ephemeral-storage":"18191325562","hugepages-1Gi":"0","hugepages-2Mi":"0","hugepages-32Mi":"0","hugepages-64Ki":"0","memory":"1419924Ki","pods":"8"}}
{"level":"DEBUG","time":"2024-05-14T10:03:03.139Z","logger":"controller.disruption","message":"discovered subnets","commit":"fb4d75f","subnets":["subnet-05d32764d86645112 (eu-central-1b)","subnet-04e892f4c7dc778f6 (eu-central-1a)","subnet-08a50cd900b173056 (eu-central-1c)"]}
{"level":"DEBUG","time":"2024-05-14T10:03:04.732Z","logger":"controller","message":"deleted launch template","commit":"fb4d75f","id":"lt-09c8264473b436381","name":"karpenter.k8s.aws/15606164045452257729"}
{"level":"DEBUG","time":"2024-05-14T10:03:04.826Z","logger":"controller","message":"deleted launch template","commit":"fb4d75f","id":"lt-0aeeacaba51557d60","name":"karpenter.k8s.aws/17404147073338961046"}
Text Logs
2024-05-14T09:58:48.524Z DEBUG 63 out of 704 instance types were excluded because they would breach limits
2024-05-14T09:58:48.627Z INFO found provisionable pod(s)
2024-05-14T09:58:48.627Z INFO computed new nodeclaim(s) to fit pod(s)
2024-05-14T09:58:48.736Z INFO created nodeclaim
2024-05-14T09:58:49.271Z DEBUG created launch template
2024-05-14T09:58:49.411Z DEBUG created launch template
2024-05-14T09:58:49.555Z DEBUG created launch template
2024-05-14T09:58:49.693Z DEBUG created launch template
2024-05-14T09:58:51.700Z INFO launched nodeclaim
2024-05-14T09:58:55.080Z DEBUG discovered subnets
2024-05-14T09:58:58.552Z INFO found provisionable pod(s)
2024-05-14T09:59:08.531Z INFO found provisionable pod(s)
2024-05-14T09:59:10.052Z INFO registered nodeclaim
2024-05-14T09:59:18.527Z DEBUG 63 out of 704 instance types were excluded because they would breach limits
2024-05-14T09:59:18.539Z INFO found provisionable pod(s)
2024-05-14T09:59:18.539Z INFO computed new nodeclaim(s) to fit pod(s)
2024-05-14T09:59:18.550Z INFO created nodeclaim
2024-05-14T09:59:20.571Z INFO launched nodeclaim
2024-05-14T09:59:28.527Z INFO found provisionable pod(s)
2024-05-14T09:59:32.089Z INFO initialized nodeclaim
2024-05-14T09:59:40.360Z INFO registered nodeclaim
2024-05-14T09:59:51.534Z DEBUG abandoning empty node consolidation attempt due to pod churn, command is no longer valid, delete, terminating 1 nodes (0 pods) ip-10-2-98-10.eu-central-1.compute.internal/c6g.medium/on-demand
2024-05-14T09:59:56.131Z DEBUG discovered subnets
2024-05-14T10:00:03.905Z INFO initialized nodeclaim
2024-05-14T10:00:06.937Z INFO disrupting via consolidation delete, terminating 1 nodes (1 pods) ip-10-2-170-180.eu-central-1.compute.internal/c6gn.medium/spot
2024-05-14T10:00:07.177Z INFO command succeeded
2024-05-14T10:00:07.231Z INFO tainted node
2024-05-14T10:00:07.247Z INFO deleted node
2024-05-14T10:00:07.668Z INFO deleted nodeclaim
2024-05-14T10:00:32.041Z INFO disrupting via consolidation delete, terminating 1 nodes (0 pods) ip-10-2-10-19.eu-central-1.compute.internal/c6g.medium/on-demand
2024-05-14T10:00:32.325Z INFO command succeeded
2024-05-14T10:00:32.523Z INFO tainted node
2024-05-14T10:00:32.624Z INFO deleted node
2024-05-14T10:00:33.048Z INFO deleted nodeclaim
2024-05-14T10:01:00.646Z DEBUG discovered subnets
2024-05-14T10:01:00.934Z INFO disrupting via consolidation delete, terminating 1 nodes (1 pods) ip-10-2-162-12.eu-central-1.compute.internal/c6gn.medium/spot
2024-05-14T10:01:01.449Z INFO command succeeded
2024-05-14T10:01:01.484Z INFO tainted node
2024-05-14T10:01:01.510Z INFO deleted node
2024-05-14T10:01:01.950Z INFO deleted nodeclaim
2024-05-14T10:01:04.731Z DEBUG deleted launch template
2024-05-14T10:01:04.821Z DEBUG deleted launch template
2024-05-14T10:01:04.934Z DEBUG deleted launch template
2024-05-14T10:01:05.028Z DEBUG deleted launch template
2024-05-14T10:01:45.832Z DEBUG 63 out of 704 instance types were excluded because they would breach limits
2024-05-14T10:01:45.846Z INFO found provisionable pod(s)
2024-05-14T10:01:45.846Z INFO computed new nodeclaim(s) to fit pod(s)
2024-05-14T10:01:45.859Z INFO created nodeclaim
2024-05-14T10:01:46.095Z DEBUG created launch template
2024-05-14T10:01:46.384Z DEBUG created launch template
2024-05-14T10:01:48.408Z INFO launched nodeclaim
2024-05-14T10:01:55.831Z INFO found provisionable pod(s)
2024-05-14T10:02:02.391Z DEBUG discovered subnets
2024-05-14T10:02:05.828Z INFO found provisionable pod(s)
2024-05-14T10:02:07.719Z INFO registered nodeclaim
2024-05-14T10:02:31.333Z INFO initialized nodeclaim
2024-05-14T10:03:03.139Z DEBUG discovered subnets
2024-05-14T10:03:04.732Z DEBUG deleted launch template
2024-05-14T10:03:04.826Z DEBUG deleted launch template
Issue 2: Deleting the on-demand workload overshooting node consolidation
When scaling the sleeper pod down, Karpenter wants to replace the on-demand node with a spot node. Which is good, but it deletes too many nodes before realizing it needed the node.
Starting from the stable scenario of one on-demand node and one spot node.
NAME TYPE ZONE NODE READY AGE CAPACITY NODEPOOL NODECLASS
arm64-hml7n c6g.medium eu-central-1b ip-10-2-98-10.eu-central-1.compute.internal True 19m on-demand arm64 default
arm64-jm7p7 c6gn.medium eu-central-1c ip-10-2-131-28.eu-central-1.compute.internal True 16m spot arm64 default
Karpenter prepares the spot replacement node.
NAME TYPE ZONE NODE READY AGE CAPACITY NODEPOOL NODECLASS
arm64-9p68m c6gn.medium eu-central-1c False 7s spot arm64 default
arm64-hml7n c6g.medium eu-central-1b ip-10-2-98-10.eu-central-1.compute.internal True 20m on-demand arm64 default
arm64-jm7p7 c6gn.medium eu-central-1c ip-10-2-131-28.eu-central-1.compute.internal True 17m spot arm64 default
And then removing the on-demand node.
NAME TYPE ZONE NODE READY AGE CAPACITY NODEPOOL NODECLASS
arm64-9p68m c6gn.medium eu-central-1c ip-10-2-168-124.eu-central-1.compute.internal True 49s spot arm64 default
arm64-jm7p7 c6gn.medium eu-central-1c ip-10-2-131-28.eu-central-1.compute.internal True 18m spot arm64 default
Then Karpenter decides it should consolidate the other spot node into the one it just created.
NAME TYPE ZONE NODE READY AGE CAPACITY NODEPOOL NODECLASS
arm64-9p68m c6gn.medium eu-central-1c ip-10-2-168-124.eu-central-1.compute.internal True 71s spot arm64 default
Until Karpenter figures out there was workload that required the second spot node.
NAME TYPE ZONE NODE READY AGE CAPACITY NODEPOOL NODECLASS
arm64-72m72 c6gn.medium eu-central-1c False 13s spot arm64 default
arm64-9p68m c6gn.medium eu-central-1c ip-10-2-168-124.eu-central-1.compute.internal True 2m18s spot arm64 default
Then finally settling down to two spot nodes.
NAME TYPE ZONE NODE READY AGE CAPACITY NODEPOOL NODECLASS
arm64-72m72 c6gn.medium eu-central-1c ip-10-2-160-121.eu-central-1.compute.internal True 52s spot arm64 default
arm64-9p68m c6gn.medium eu-central-1c ip-10-2-168-124.eu-central-1.compute.internal True 2m57s spot arm64 default
JSON Logs
{"level":"INFO","time":"2024-05-14T10:19:01.751Z","logger":"controller.disruption","message":"disrupting via consolidation replace, terminating 1 nodes (1 pods) ip-10-2-98-10.eu-central-1.compute.internal/c6g.medium/on-demand and replacing with spot node from types c6gn.medium, c6g.medium, c7g.medium, c7gd.medium, c6gd.medium and 11 other(s)","commit":"fb4d75f","command-id":"282d341d-b7b6-49ad-97de-0678d046669f"}
{"level":"INFO","time":"2024-05-14T10:19:01.785Z","logger":"controller.disruption","message":"created nodeclaim","commit":"fb4d75f","nodepool":"arm64","nodeclaim":"arm64-9p68m","requests":{"cpu":"200m","memory":"10Mi","pods":"3"},"instance-types":"c6g.large, c6g.medium, c6gd.medium, c6gn.medium, c7g.medium and 11 other(s)"}
{"level":"DEBUG","time":"2024-05-14T10:19:01.785Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"fb4d75f"}
{"level":"DEBUG","time":"2024-05-14T10:19:02.116Z","logger":"controller.nodeclaim.lifecycle","message":"created launch template","commit":"fb4d75f","nodeclaim":"arm64-9p68m","launch-template-name":"karpenter.k8s.aws/15606164045452257729","id":"lt-015eca1570b6ad05f"}
{"level":"DEBUG","time":"2024-05-14T10:19:02.310Z","logger":"controller.nodeclaim.lifecycle","message":"created launch template","commit":"fb4d75f","nodeclaim":"arm64-9p68m","launch-template-name":"karpenter.k8s.aws/17404147073338961046","id":"lt-0ad46f50d1805e1a9"}
{"level":"DEBUG","time":"2024-05-14T10:19:02.777Z","logger":"controller.provisioner","message":"waiting on cluster sync","commit":"fb4d75f"}
{"level":"DEBUG","time":"2024-05-14T10:19:02.786Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"fb4d75f"}
{"level":"DEBUG","time":"2024-05-14T10:19:03.788Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"fb4d75f"}
{"level":"INFO","time":"2024-05-14T10:19:04.668Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"fb4d75f","nodeclaim":"arm64-9p68m","provider-id":"aws:///eu-central-1c/i-0e148e875e0fcfb79","instance-type":"c6gn.medium","zone":"eu-central-1c","capacity-type":"spot","allocatable":{"cpu":"940m","ephemeral-storage":"17Gi","memory":"1392Mi","pods":"8","vpc.amazonaws.com/pod-eni":"4"}}
{"level":"INFO","time":"2024-05-14T10:19:12.825Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"cilium/cilium-operator-786bd475cf-2q8mn","duration":"49.110371ms"}
{"level":"DEBUG","time":"2024-05-14T10:19:22.324Z","logger":"controller.nodeclaim.garbagecollection","message":"discovered subnets","commit":"fb4d75f","subnets":["subnet-05d32764d86645112 (eu-central-1b)","subnet-04e892f4c7dc778f6 (eu-central-1a)","subnet-08a50cd900b173056 (eu-central-1c)"]}
{"level":"INFO","time":"2024-05-14T10:19:22.746Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","commit":"fb4d75f","nodeclaim":"arm64-9p68m","provider-id":"aws:///eu-central-1c/i-0e148e875e0fcfb79","node":"ip-10-2-168-124.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-05-14T10:19:22.926Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"cilium/cilium-operator-786bd475cf-2q8mn","duration":"150.300724ms"}
{"level":"INFO","time":"2024-05-14T10:19:32.825Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"cilium/cilium-operator-786bd475cf-2q8mn","duration":"48.333448ms"}
{"level":"INFO","time":"2024-05-14T10:19:42.829Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"cilium/cilium-operator-786bd475cf-2q8mn","duration":"50.973737ms"}
{"level":"INFO","time":"2024-05-14T10:19:44.546Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"fb4d75f","nodeclaim":"arm64-9p68m","provider-id":"aws:///eu-central-1c/i-0e148e875e0fcfb79","node":"ip-10-2-168-124.eu-central-1.compute.internal","allocatable":{"cpu":"940m","ephemeral-storage":"18191325562","hugepages-1Gi":"0","hugepages-2Mi":"0","hugepages-32Mi":"0","hugepages-64Ki":"0","memory":"1419924Ki","pods":"8"}}
{"level":"INFO","time":"2024-05-14T10:19:48.167Z","logger":"controller.disruption.queue","message":"command succeeded","commit":"fb4d75f","command-id":"282d341d-b7b6-49ad-97de-0678d046669f"}
{"level":"INFO","time":"2024-05-14T10:19:48.204Z","logger":"controller.node.termination","message":"tainted node","commit":"fb4d75f","node":"ip-10-2-98-10.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-05-14T10:19:48.230Z","logger":"controller.node.termination","message":"deleted node","commit":"fb4d75f","node":"ip-10-2-98-10.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-05-14T10:19:48.716Z","logger":"controller.nodeclaim.termination","message":"deleted nodeclaim","commit":"fb4d75f","nodeclaim":"arm64-hml7n","node":"ip-10-2-98-10.eu-central-1.compute.internal","provider-id":"aws:///eu-central-1b/i-0c33c9484ccbf1b09"}
{"level":"DEBUG","time":"2024-05-14T10:20:04.758Z","logger":"controller","message":"deleted launch template","commit":"fb4d75f","id":"lt-015eca1570b6ad05f","name":"karpenter.k8s.aws/15606164045452257729"}
{"level":"DEBUG","time":"2024-05-14T10:20:04.856Z","logger":"controller","message":"deleted launch template","commit":"fb4d75f","id":"lt-0ad46f50d1805e1a9","name":"karpenter.k8s.aws/17404147073338961046"}
{"level":"INFO","time":"2024-05-14T10:20:10.060Z","logger":"controller.disruption","message":"disrupting via consolidation delete, terminating 1 nodes (1 pods) ip-10-2-131-28.eu-central-1.compute.internal/c6gn.medium/spot","commit":"fb4d75f","command-id":"1e2ecb8f-e18c-4a02-aea7-4deeef32260c"}
{"level":"INFO","time":"2024-05-14T10:20:10.252Z","logger":"controller.disruption.queue","message":"command succeeded","commit":"fb4d75f","command-id":"1e2ecb8f-e18c-4a02-aea7-4deeef32260c"}
{"level":"INFO","time":"2024-05-14T10:20:10.290Z","logger":"controller.node.termination","message":"tainted node","commit":"fb4d75f","node":"ip-10-2-131-28.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-05-14T10:20:10.307Z","logger":"controller.node.termination","message":"deleted node","commit":"fb4d75f","node":"ip-10-2-131-28.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-05-14T10:20:10.695Z","logger":"controller.nodeclaim.termination","message":"deleted nodeclaim","commit":"fb4d75f","nodeclaim":"arm64-jm7p7","node":"ip-10-2-131-28.eu-central-1.compute.internal","provider-id":"aws:///eu-central-1c/i-045a9ed0b25b46530"}
{"level":"DEBUG","time":"2024-05-14T10:20:24.981Z","logger":"controller.nodeclass","message":"discovered subnets","commit":"fb4d75f","ec2nodeclass":"default","subnets":["subnet-05d32764d86645112 (eu-central-1b)","subnet-04e892f4c7dc778f6 (eu-central-1a)","subnet-08a50cd900b173056 (eu-central-1c)"]}
{"level":"DEBUG","time":"2024-05-14T10:20:45.143Z","logger":"controller.disruption","message":"abandoning empty node consolidation attempt due to pod churn, command is no longer valid, delete, terminating 1 nodes (0 pods) ip-10-2-168-124.eu-central-1.compute.internal/c6gn.medium/spot","commit":"fb4d75f"}
{"level":"DEBUG","time":"2024-05-14T10:21:06.525Z","logger":"controller.provisioner","message":"63 out of 704 instance types were excluded because they would breach limits","commit":"fb4d75f","nodepool":"arm64"}
{"level":"INFO","time":"2024-05-14T10:21:06.828Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"cilium/cilium-operator-786bd475cf-sz45c","duration":"588.931561ms"}
{"level":"INFO","time":"2024-05-14T10:21:06.829Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"fb4d75f","nodeclaims":1,"pods":1}
{"level":"INFO","time":"2024-05-14T10:21:06.842Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"fb4d75f","nodepool":"arm64","nodeclaim":"arm64-72m72","requests":{"cpu":"200m","memory":"10Mi","pods":"3"},"instance-types":"c6g.2xlarge, c6g.4xlarge, c6g.large, c6g.medium, c6g.xlarge and 55 other(s)"}
{"level":"DEBUG","time":"2024-05-14T10:21:07.057Z","logger":"controller.nodeclaim.lifecycle","message":"created launch template","commit":"fb4d75f","nodeclaim":"arm64-72m72","launch-template-name":"karpenter.k8s.aws/17404147073338961046","id":"lt-0dfc2fa3d5b427f10"}
{"level":"DEBUG","time":"2024-05-14T10:21:07.206Z","logger":"controller.nodeclaim.lifecycle","message":"created launch template","commit":"fb4d75f","nodeclaim":"arm64-72m72","launch-template-name":"karpenter.k8s.aws/15606164045452257729","id":"lt-0623b640f6e62bc5e"}
{"level":"INFO","time":"2024-05-14T10:21:09.141Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"fb4d75f","nodeclaim":"arm64-72m72","provider-id":"aws:///eu-central-1c/i-02e221f516dcfa08c","instance-type":"c6gn.medium","zone":"eu-central-1c","capacity-type":"spot","allocatable":{"cpu":"940m","ephemeral-storage":"17Gi","memory":"1392Mi","pods":"8","vpc.amazonaws.com/pod-eni":"4"}}
{"level":"INFO","time":"2024-05-14T10:21:19.827Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"fb4d75f","pods":"cilium/cilium-operator-786bd475cf-sz45c","duration":"3.587052721s"}
{"level":"INFO","time":"2024-05-14T10:21:25.746Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","commit":"fb4d75f","nodeclaim":"arm64-72m72","provider-id":"aws:///eu-central-1c/i-02e221f516dcfa08c","node":"ip-10-2-160-121.eu-central-1.compute.internal"}
{"level":"DEBUG","time":"2024-05-14T10:21:26.029Z","logger":"controller.nodeclaim.disruption","message":"discovered subnets","commit":"fb4d75f","nodeclaim":"arm64-72m72","subnets":["subnet-05d32764d86645112 (eu-central-1b)","subnet-04e892f4c7dc778f6 (eu-central-1a)","subnet-08a50cd900b173056 (eu-central-1c)"]}
{"level":"INFO","time":"2024-05-14T10:21:49.178Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"fb4d75f","nodeclaim":"arm64-72m72","provider-id":"aws:///eu-central-1c/i-02e221f516dcfa08c","node":"ip-10-2-160-121.eu-central-1.compute.internal","allocatable":{"cpu":"940m","ephemeral-storage":"18191325562","hugepages-1Gi":"0","hugepages-2Mi":"0","hugepages-32Mi":"0","hugepages-64Ki":"0","memory":"1419924Ki","pods":"8"}}
{"level":"DEBUG","time":"2024-05-14T10:22:31.434Z","logger":"controller.disruption","message":"discovered subnets","commit":"fb4d75f","subnets":["subnet-05d32764d86645112 (eu-central-1b)","subnet-04e892f4c7dc778f6 (eu-central-1a)","subnet-08a50cd900b173056 (eu-central-1c)"]}
{"level":"DEBUG","time":"2024-05-14T10:23:04.721Z","logger":"controller","message":"deleted launch template","commit":"fb4d75f","id":"lt-0dfc2fa3d5b427f10","name":"karpenter.k8s.aws/17404147073338961046"}
{"level":"DEBUG","time":"2024-05-14T10:23:04.826Z","logger":"controller","message":"deleted launch template","commit":"fb4d75f","id":"lt-0623b640f6e62bc5e","name":"karpenter.k8s.aws/15606164045452257729"}
Text Logs
2024-05-14T10:19:01.751Z INFO disrupting via consolidation replace, terminating 1 nodes (1 pods) ip-10-2-98-10.eu-central-1.compute.internal/c6g.medium/on-demand and replacing with spot node from types c6gn.medium, c6g.medium, c7g.medium, c7gd.medium, c6gd.medium and 11 other(s)
2024-05-14T10:19:01.785Z INFO created nodeclaim
2024-05-14T10:19:01.785Z DEBUG waiting on cluster sync
2024-05-14T10:19:02.116Z DEBUG created launch template
2024-05-14T10:19:02.310Z DEBUG created launch template
2024-05-14T10:19:02.777Z DEBUG waiting on cluster sync
2024-05-14T10:19:02.786Z DEBUG waiting on cluster sync
2024-05-14T10:19:03.788Z DEBUG waiting on cluster sync
2024-05-14T10:19:04.668Z INFO launched nodeclaim
2024-05-14T10:19:12.825Z INFO found provisionable pod(s)
2024-05-14T10:19:22.324Z DEBUG discovered subnets
2024-05-14T10:19:22.746Z INFO registered nodeclaim
2024-05-14T10:19:22.926Z INFO found provisionable pod(s)
2024-05-14T10:19:32.825Z INFO found provisionable pod(s)
2024-05-14T10:19:42.829Z INFO found provisionable pod(s)
2024-05-14T10:19:44.546Z INFO initialized nodeclaim
2024-05-14T10:19:48.167Z INFO command succeeded
2024-05-14T10:19:48.204Z INFO tainted node
2024-05-14T10:19:48.230Z INFO deleted node
2024-05-14T10:19:48.716Z INFO deleted nodeclaim
2024-05-14T10:20:04.758Z DEBUG deleted launch template
2024-05-14T10:20:04.856Z DEBUG deleted launch template
2024-05-14T10:20:10.060Z INFO disrupting via consolidation delete, terminating 1 nodes (1 pods) ip-10-2-131-28.eu-central-1.compute.internal/c6gn.medium/spot
2024-05-14T10:20:10.252Z INFO command succeeded
2024-05-14T10:20:10.290Z INFO tainted node
2024-05-14T10:20:10.307Z INFO deleted node
2024-05-14T10:20:10.695Z INFO deleted nodeclaim
2024-05-14T10:20:24.981Z DEBUG discovered subnets
2024-05-14T10:20:45.143Z DEBUG abandoning empty node consolidation attempt due to pod churn, command is no longer valid, delete, terminating 1 nodes (0 pods) ip-10-2-168-124.eu-central-1.compute.internal/c6gn.medium/spot
2024-05-14T10:21:06.525Z DEBUG 63 out of 704 instance types were excluded because they would breach limits
2024-05-14T10:21:06.828Z INFO found provisionable pod(s)
2024-05-14T10:21:06.829Z INFO computed new nodeclaim(s) to fit pod(s)
2024-05-14T10:21:06.842Z INFO created nodeclaim
2024-05-14T10:21:07.057Z DEBUG created launch template
2024-05-14T10:21:07.206Z DEBUG created launch template
2024-05-14T10:21:09.141Z INFO launched nodeclaim
2024-05-14T10:21:19.827Z INFO found provisionable pod(s)
2024-05-14T10:21:25.746Z INFO registered nodeclaim
2024-05-14T10:21:26.029Z DEBUG discovered subnets
2024-05-14T10:21:49.178Z INFO initialized nodeclaim
2024-05-14T10:22:31.434Z DEBUG discovered subnets
2024-05-14T10:23:04.721Z DEBUG deleted launch template
2024-05-14T10:23:04.826Z DEBUG deleted launch template
Expected Behavior: Not consolidating nodes that are required by podAntiAffinity
Reproduction Steps (Please include YAML):
- Launch deployment with 2 replica's without nodeselector, podAntiAffinity
- Wait for 2 spot nodes
- Launch deployment with 1 replica with nodeselector for on-demand node, without podAntiAffinity
- Watch Karpenter logs and nodeclaims
First yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: first
name: first
namespace: default
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: first
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: first
spec:
containers:
- command:
- /bin/sleep
- infinity
image: ubuntu:latest
imagePullPolicy: Always
name: first
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: first
topologyKey: kubernetes.io/hostname
Second yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: second
name: second
namespace: default
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: second
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: second
spec:
containers:
- command:
- /bin/sleep
- infinity
image: ubuntu:latest
imagePullPolicy: Always
name: second
nodeSelector:
karpenter.sh/capacity-type: on-demand
Versions:
- Chart Version: karpenter-0.36.1
- Kubernetes Version (
kubectl version
):
Client Version: v1.29.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.3-eks-adc7111
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment