karpenter-provider-aws
karpenter-provider-aws copied to clipboard
Karpenter does not clean up empty drifted nodes
Description
Observed Behavior:
We upgraded the AMI of an EC2NodeClass and saw the existing NodeClaims in Drifted status which is expected. We also deleted all pods running on one of the existing nodes and have confirmed only DaemonSet pods are running there. However, the NodeClaim/node is not removed by Karpenter.
Query owners of all pods running on the node. Only got DaemonSet
$ kubectl get pod --no-headers -A --field-selector=spec.nodeName=<NODE_NAME> -o jsonpath='{range .items[*]}{@.metadata.ownerReferences[0].kind}{"\n"}'|sort|uniq
DaemonSet
NodeClaim status and event:
Conditions:
Last Transition Time: 2025-04-24T18:26:44Z
Message:
Observed Generation: 1
Reason: Launched
Status: True
Type: Launched
Last Transition Time: 2025-04-24T18:28:35Z
Message:
Observed Generation: 1
Reason: Registered
Status: True
Type: Registered
Last Transition Time: 2025-04-24T18:28:54Z
Message:
Observed Generation: 1
Reason: Initialized
Status: True
Type: Initialized
Last Transition Time: 2025-04-24T18:36:44Z
Message:
Observed Generation: 1
Reason: ConsistentStateFound
Status: True
Type: ConsistentStateFound
Last Transition Time: 2025-04-24T19:17:56Z
Message: AMIDrift
Observed Generation: 1
Reason: AMIDrift
Status: True
Type: Drifted
Last Transition Time: 2025-04-24T19:44:39Z
Message:
Observed Generation: 1
Reason: Consolidatable
Status: True
Type: Consolidatable
Last Transition Time: 2025-04-24T18:28:54Z
Message:
Observed Generation: 1
Reason: Ready
Status: True
Type: Ready
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Unconsolidatable 12m (x687 over 7d4h) karpenter NodePool "jobs" has non-empty consolidation disabled
Nodepool configurations
spec:
disruption:
budgets:
- nodes: 10%
consolidateAfter: 20m
consolidationPolicy: WhenEmpty
Expected Behavior:
Karpenter should clean up the NodeClaim and node if it is empty.
Reproduction Steps (Please include YAML):
Versions:
- Chart Version:
1.2.1 - Kubernetes Version (
kubectl version):1.30
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
I think https://github.com/kubernetes-sigs/karpenter/blob/3087cce35ccd741461110028ed16f15434933c06/pkg/controllers/disruption/controller.go#L180 would have picked up "drift" as the reason and https://github.com/kubernetes-sigs/karpenter/blob/3087cce35ccd741461110028ed16f15434933c06/pkg/controllers/disruption/drift.go#L77 empty would have placed it in to Drift budget which we have none of.
I have observed the same problem. Karpenter does not remove nodes when only daemonsets are left on it.
Karpenter 1.3.2 Kubernetes 1.31
I have also observed this problem. I am using Nvidia MIG on a GPU instance with Nvidia gpu-operator which effectively removes 4 GPUs out of 8 shortly after the instance starts, so this node is always "Drifted" after Nvidia MIG config has been applied.
Karpenter 1.3.2 Kubernetes 1.32
Spec:
Disruption:
Budgets:
Duration: 13h
Nodes: 0
Reasons:
Drifted
Schedule: 30 21 * * *
Consolidate After: 300s
Consolidation Policy: WhenEmpty
Log entry
{"level":"ERROR","time":"2025-08-12T10:50:12.926Z","logger":"controller","caller":"consistency/controller.go:110","message":"consistency error","commit":"1c39126","controller":"nodeclaim.consistency","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"karpenter-some-name-tj9ld"},"namespace":"","name":"karpenter-some-name-tj9ld","reconcileID":"a98caec6-759d-421b-a478-213f5fa216d9","Node":{"name":"some-node-name"},"error":"expected 8 of resource nvidia.com/gpu, but found 4 (50.0% of expected)"}
Is this still an issue with Karpenter 1.6? Because we are seeing an issue where nodes are not being cleaned up even though only daemonsets are running on the nodes.
@kirann-hegde can you share the settings, which value do you have for consolidateAfter ?
I am too observing the same (meaning only Daemonset pods are left) on karpenter 1.6.2 with a NodePool properties
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 5m
EKS 1.33
Update: resolved on my side. The issue with a node not being terminated was due to a stuck NodeClaim on another node which was blocking the termination because of the default (10%) disruption budget on the NodePool. After increasing to 20%, the node was gone.