karpenter-provider-aws
karpenter-provider-aws copied to clipboard
Karpenter "Underutilised" disruption causing excessive node churn
Description
Context We run Karpenter in production for an O(15) node EKS cluster, using four (mutually exclusive) NodePools for different classes of application.
Our primary NodePool is workload, which provisions capacity for the majority of our pods.
Observed Behavior:
Approximately daily, we experience a period of high (karpenter) workload node volatility caused by consolidation disruptions (reason: Underutilised).
This usually means that a large proportion of workload nodes get disrupted and replaced in a short period of time.
We usually see the newly-created nodes run for about 5-10 minutes, before they too are disrupted as Underutilised.
This disruption period usually occurs for 2-3 generations of replacement nodes, before stopping abruptly. The resulting nodes then typically run without disruption for many hours.
Notably, these events typically occur outside of office hours where changes to the running pods are very unlikely (e.g. rolling upgrades) and traffic is usually very low.
The end result node topology is usually comparable to the starting topology, if not more complex, which doesn't seem to suggest there was any significant resource underutilisation. However in endemic cases, the pods hosted on these nodes may have been restarted up to four times in rapid succession, which is not desirable.
For example:
On 27th September between at 22:45 (local time) we had 14 running workload nodeclaims:
- 7x m6a.large (or equivalent)
- 4x m7i-flex.xlarge (or equivalent)
- 3x m7i-flex-2xlarge
Between 22:45 and 23:15 (local time), 7 of these nodeclaims were disrupted and replaced with successive generations of m7i-flex.large nodeclaims (or equivalent) - a total of 15 "Underutilised" disruptions.
At the end of this process we were running 17 workload nodeclaims:
- 11x m7i-flex.large (or equivalent)
- 3x m6a.xlarge (or equivalent)
- 3x m7i-flex-2xlarge
So the net effect was replacing one xlarge node with 4 large nodes, and shuffling the instance generations slightly.
Pictorially:
(each green bar represents a nodeclaim, with time along the x axis)
Expected Behavior:
- Consolidation disruption due to underutilisation occurs as a single operation, such that pods hosted on these nodes only experience one restart.
- Nodeclaims created due to "Underutilised" consolidation should not be provisioned in an Underutilised state, necessitating further disruption.
Reproduction Steps (Please include YAML):
Our workload NodePool config is:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
annotations:
meta.helm.sh/release-namespace: karpenter
labels:
app.kubernetes.io/managed-by: Helm
name: workload
spec:
disruption:
budgets:
- nodes: 50%
consolidateAfter: 5m
consolidationPolicy: WhenEmptyOrUnderutilized
limits:
cpu: '256'
memory: 1Ti
template:
metadata:
labels:
app: karpenter
environment: prod
name: karpenter
spec:
expireAfter: 336h
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
requirements:
- key: karpenter.k8s.aws/instance-category
operator: In
values:
- r
- m
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values:
- '3'
- key: karpenter.k8s.aws/instance-cpu
operator: Lt
values:
- '17'
- key: karpenter.k8s.aws/instance-cpu
operator: Gt
values:
- '0'
- key: karpenter.k8s.aws/instance-memory
operator: Lt
values:
- '131073'
- key: karpenter.k8s.aws/instance-memory
operator: Gt
values:
- '2047'
- key: topology.kubernetes.io/zone
operator: In
values:
- eu-west-2a
- eu-west-2b
- eu-west-2c
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
- key: kubernetes.io/os
operator: In
values:
- linux
- key: kubernetes.io/arch
operator: In
values:
- amd64
taints:
- effect: NoSchedule
key: karpenter.sh
The corresponding EC2NodeClass is:
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
annotations:
meta.helm.sh/release-namespace: karpenter
labels:
app: karpenter
app.kubernetes.io/managed-by: Helm
name: karpenter
name: default
spec:
amiFamily: AL2
amiSelectorTerms:
- name: amazon-eks-node-1.29-*
owner: amazon
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
deleteOnTermination: true
encrypted: true
kmsKeyID: >-
<masked>
volumeSize: 100Gi
volumeType: gp3
instanceProfile: KarpenterNodeInstanceProfile-prod-eu-west-2-eks
metadataOptions:
httpEndpoint: enabled
httpProtocolIPv6: disabled
httpPutResponseHopLimit: 1
httpTokens: required
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: prod-eu-west-2-eks
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: prod-eu-west-2-eks
We also have the spot-to-spot consolidation feature flag enabled.
Versions:
- Chart Version: 1.0.2
- Kubernetes Version (
kubectl version): 1.29
However, we have observed this behaviour as far back as chart v0.36.2 using the v1beta1 CRDs. We've also seen this on v1.28 and earlier version of Kubernetes.
Additional Questions:
- Can you explain what threshold the Karpenter controller uses to determine Underutilisation?
- We see 50% of nodes being affected during the consolidation disruption window, (matching our disruption budget), but why do we not see similar disruption before and after this ~30 minute window?
For me, It's flapping nodes like crazy. How does it determine that a node is underutilized? Can we lower the threshold?
{"level":"INFO","time":"2024-10-14T19:21:24.227Z","logger":"controller","caller":"disruption/controller.go:176","message":"disrupting nodeclaim(s) via delete, terminating 1 nodes (21 pods) ip-10-60-213-231.ec2.internal/t3.2xlarge/on-demand","commit":"5bdf9c3","controller":"disruption","namespace":"","name":"","reconcileID":"c6a54af5-967b-4783-b3fc-f5844afe5342","command-id":"c41ae5f3-5b0b-4d9f-9fd4-ad7abcb05cff","reason":"underutilized"}
{"level":"INFO","time":"2024-10-14T19:23:11.702Z","logger":"controller","caller":"disruption/controller.go:176","message":"disrupting nodeclaim(s) via delete, terminating 1 nodes (19 pods) ip-10-60-149-172.ec2.internal/t3.2xlarge/on-demand","commit":"5bdf9c3","controller":"disruption","namespace":"","name":"","reconcileID":"166e47a5-24fe-468c-b979-f8a8a8a31e90","command-id":"7df82325-f24a-4a6e-8759-83fc6267c234","reason":"underutilized"}
{"level":"INFO","time":"2024-10-14T19:25:50.448Z","logger":"controller","caller":"disruption/controller.go:176","message":"disrupting nodeclaim(s) via delete, terminating 1 nodes (24 pods) ip-10-60-150-105.ec2.internal/t3.2xlarge/on-demand","commit":"5bdf9c3","controller":"disruption","namespace":"","name":"","reconcileID":"5cab5197-7eaa-47b9-ba5b-c31918b99733","command-id":"d03d50e8-22fe-406b-b299-d6363573051e","reason":"underutilized"}
{"level":"INFO","time":"2024-10-14T19:27:27.762Z","logger":"controller","caller":"disruption/controller.go:176","message":"disrupting nodeclaim(s) via delete, terminating 1 nodes (20 pods) ip-10-60-29-142.ec2.internal/t3.2xlarge/on-demand","commit":"5bdf9c3","controller":"disruption","namespace":"","name":"","reconcileID":"2d125665-5702-4351-9fdc-199c30e67439","command-id":"806ed776-e8be-4eeb-95d2-63e12b68df28","reason":"underutilized"}
{"level":"INFO","time":"2024-10-14T19:29:59.176Z","logger":"controller","caller":"disruption/controller.go:176","message":"disrupting nodeclaim(s) via delete, terminating 1 nodes (28 pods) ip-10-60-26-48.ec2.internal/t3.2xlarge/on-demand","commit":"5bdf9c3","controller":"disruption","namespace":"","name":"","reconcileID":"32612bb0-5066-4dc3-adae-5779c6557fe8","command-id":"eaeac7f6-ab26-41f4-99ba-daa5d0b8eb43","reason":"underutilized"}
{"level":"INFO","time":"2024-10-14T19:32:24.048Z","logger":"controller","caller":"disruption/controller.go:176","message":"disrupting nodeclaim(s) via delete, terminating 1 nodes (28 pods) ip-10-60-145-213.ec2.internal/t3.2xlarge/on-demand","commit":"5bdf9c3","controller":"disruption","namespace":"","name":"","reconcileID":"9658aa9f-c84b-4ceb-973e-f4ff304d2bb3","command-id":"fc06784c-b80e-4668-ae79-7696f9cad551","reason":"underutilized"}
{"level":"INFO","time":"2024-10-14T19:37:33.396Z","logger":"controller","caller":"disruption/controller.go:176","message":"disrupting nodeclaim(s) via delete, terminating 1 nodes (36 pods) ip-10-60-146-14.ec2.internal/t3.2xlarge/on-demand","commit":"5bdf9c3","controller":"disruption","namespace":"","name":"","reconcileID":"74a7d084-82ed-48b6-996c-9667dcd4ee39","command-id":"d336a720-a745-4b36-971a-fd30c5e51ecd","reason":"underutilized"}
{"level":"INFO","time":"2024-10-14T19:40:48.090Z","logger":"controller","caller":"disruption/controller.go:176","message":"disrupting nodeclaim(s) via delete, terminating 1 nodes (24 pods) ip-10-60-86-19.ec2.internal/t3.2xlarge/on-demand","commit":"5bdf9c3","controller":"disruption","namespace":"","name":"","reconcileID":"339a1581-48c9-46da-a825-84c87103c507","command-id":"f56bdaac-dca0-4388-b7f2-9fdec4a66942","reason":"underutilized"}
The most annoying thing is that it terminates nodes with cpu requests ~95%.For example these 2 from the log.
But in general it terminates and creates new(the only gap is restart, b/c i have confolidateAfter: 30m)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: eks-dev-linux-amd64
spec:
disruption:
budgets:
- nodes: "3"
reasons:
- Empty
- nodes: "1"
reasons:
- Underutilized
- Drifted
consolidateAfter: 30m
consolidationPolicy: WhenEmptyOrUnderutilized
limits:
cpu: 1k
memory: 3600Gi
template:
metadata: {}
spec:
expireAfter: 360h
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: eks-dev-private-linux-amd64-483662afb67fa01941078b5fb6059056e47
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values:
- t3.2xlarge
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: kubernetes.io/os
operator: In
values:
- linux
weight: 50
We are seeing the same issue, lots of nodes flapping, seemingly because Karpenter believes everything is underutilized, eg.
karpenter-<redacted>-wrrqr controller {"level":"INFO","time":"2024-10-15T03:51:16.165Z","logger":"controller","message":"disrupting nodeclaim(s) via delete, terminating 1 nodes (6 pods) ip-<redacted>on-demand","commit":"6174c75","controller":"disruption","namespace":"","name":"","reconcileID":"fa42cbd2-273c-4970-8246-d18594d2f78f","command-id":"8e30b622-a60d-4eab-b1d9-b998186c54b0","reason":"underutilized"}
It would be good to know how to determine what Karpenter is using to base this utilisation check on, and whether we can fix it ourselves.
Has anyone tried PDBs ? source: https://catalog.workshops.aws/karpenter/en-US/scheduling-constraints/pod-disruption#4.-deploy-pdb-and-application
PDBs might reduce interruption frequency for a specific group of pods but that wouldn't change any of Karpenter's desire to reclaim a node in a way that doesn't actually improve things or reduce costs.
Having the ability to influence this by requiring a minimum cost savings when replacing a node or tuning the threshold of what's deemed underutilized would be useful.
I also see this also when running 1.0.6 on a large production cluster, same setup worked fine in 0.37.5 (We have some topologySpreadConstraints in place)
It seems that disabling the spot-to-spot consolidation makes Karpenter behave again like it should. After disabling it the excessive amount of node flapping/swapping is less but not completely like before.
@ChrisV78 Are you including a mix of Spot and On-Demand node types in your NodePool config?
Is it possible that you're not seeing the excessive node churn because disabling the spot-to-spot consolidation feature flag has effectively disabled consolidation altogether?
Only NodePool with spot and I still have plenty of consolidation on spots with "underutilised" (and "empty"), but not in the excessive amount when spot-to-spot consolidation was still on true.
I only have On-Demand instances and still see constant node cycling that leads to the same/similar number of nodes and types.
Same here, I have a single NodePool with only on-demand instances and having consolidationPolicy: WhenUnderutilized it makes nodes continuously disrupting. I am not sure if limitating cpu also worsen this situation, I have a limits cpu:1000. Could you please gave us more advice on how to address this situation?
Currently seeing very similar in our fleet. Relatively static workloads in a nodepool of just on-demand instances sometimes leads to a situation with constant node churn.
edit: worth noting that in our case, we're assigning a single instance type to these nodepools we're seeing this happen with most
@ChrisV78 disabling the spot-to-spot consolidation will help for on-demand instances?? I have all the nodepools for on-demand instances and consolidationPolicy set to WhenUnderutilized. I am facing similar issue here- https://github.com/aws/karpenter-provider-aws/issues/7344 It makes nodes continuously disrupting even though SpotToSpotConsolidation is already set to false.
Same as every one here does anybody got workaround without disabling SpotToSpotConsolidation?
disabling SpotToSpotConsolidation didn't solve the problem for me. anybody got version thats works? or know how to changes configuration to configure underutilization percentage
it's seems it's gets the minimum between resource types which can cause underutilization
@ChrisV78 disabling the spot-to-spot consolidation will help for on-demand instances?? I have all the nodepools for on-demand instances and consolidationPolicy set to WhenUnderutilized. I am facing similar issue here- #7344 It makes nodes continuously disrupting even though SpotToSpotConsolidation is already set to false.
That's not what I posted, see also https://github.com/aws/karpenter-provider-aws/issues/7146#issuecomment-2449864999
I'm seeing similar behavior on v1.0.6 with only on-demand instances.
- karpenter will terminate nodes it thinks are underutilized, but then will replace it with the exact same instance type
m5a.large
here's an example of the make up of the nodes in a cluster that I see the issue, are the utilizations for some nodes so low that it triggers this behavior?
You can easily have the following scenario, and I have provoked these scenarios many times;
- 95%ish cpu and memory utilization across the nodepool, (after some time, randomness and luck you can see Karpenter reach a very nice and stable state like this)
- all Karpenter chosen nodes being
mfamily.
Now you Drain a node, because of how Karpenter happens to batch these pending pods you see the immediate replacement is a c node, followed by another c node. Now a few minutes later Karpenter decides to restructure other nodes even though these are also very well utilized because it decides it can rebalance things better… and the newer nodes now had very recent activity so the consolidateAfter: NN prevents the actually BETTER candidate nodes to be replaced.
If you wait half an hour you may end up in the exact same ending scenario as before you decided to “provoke” the cluster by draining a node; except that Karpenter passed workloads through several different r nodes, more c nodes and some even of varying sizes.
My workaround for now is to try as much as possible to get Karpenter to be Cluster Autoscaler by looking at the average resource distribution of the workloads and configure node pools to have one single instance family so Karpenter doesn’t take “rash” decisions (ie choose the wrong cpu:ram ratio instance just because that particular isolated pending pod batch happened to match a certain instance type).
I have now done the following, all to avoid some of the random havoc from how Karpenter is designed and default deployed:
- single instance family
- Very limited instance size choice
- Bigger pod batch windows to give more data for better decisions (and higher chance of favouring the slightly bigger node candidate)
expireAfter: Neverbecause otherwise the uptime of a node influences Karpenters node eviction decisions when consolidating (ie it will have a higher chance of choosing a high uptime and higher-binpacked node because it is closer to expiration and thus pod eviction costs are lower)
Some of the basis for these decisions can only be found in the source code.
@frimik so do you have multiple "generic" NodePools for each instance family or are different workloads targeting m vs r? If the latter, I would've expected Karpenter to still end up consolidating nodes between the two pools.
I couldn't find a workaround that actually stopped this behavior so instead opted for a setup that aims to reduce it. For a ~500 node cluster the following was a good enough balance where >60% nodes tend to live until expiration:
- Longer consolidateAfter duration (9-12hr) to limit the constant node thrash
- Run StatefulSets with do-not-disrupt to further limit the scope of nodes for consolidation consideration. We happen to have a good balance of StatefulSets vs other workloads where it makes sense but YMMV.
- Require a higher baseline of instance cpu/memory in the NodePools both to limit instance-type choice and to limit the number of overall nodes which makes the thrash a bit less frequent.
Hi everyone,
I have enabled the underutilised condition with scheduling enabled. consolidationPolicy: WhenEmptyOrUnderutilized budgets: - nodes: "20%" schedule: "30 18 * * 6,0" duration: 7h reasons: - "Underutilized"
If this above condition is enabled then the consolidation is happens at night time from 12am right, but disruption happens at daytime also, why. I have tried with many conditions to enable the underutlisiation needs to happens at night time but all the conditions are failed.
Please help me in this case why it is happening and give the solution.
Hi everyone,
I have enabled the underutilised condition with scheduling enabled. consolidationPolicy: WhenEmptyOrUnderutilized budgets: - nodes: "20%" schedule: "30 18 * * 6,0" duration: 7h reasons: - "Underutilized"
If this above condition is enabled then the consolidation is happens at night time from 12am right, but disruption happens at daytime also, why. I have tried with many conditions to enable the underutlisiation needs to happens at night time but all the conditions are failed.
Please help me in this case why it is happening and give the solution.
30 18 * * 6,0 means “At 18:30 on Saturday and Sunday.”
Off topic... but eh
I have enabled the underutilised condition with scheduling enabled. consolidationPolicy: WhenEmptyOrUnderutilized budgets: - nodes: "20%" schedule: "30 18 * * 6,0" duration: 7h reasons: - "Underutilized"
What are you trying to achieve exactly? I'm guessing you want to turn it around to blocking disruption during office hours, following example:
budgets:
- nodes: 20% # First budget is cycle max 20%
- duration: 18h # duration added after the schedule
nodes: 0 # Actually setting budget to 0 nodes, meaning no changes
schedule: "0 4 * * mon-fri" # 4AM monday to friday, start of block window. Add duration to determine end of window (here 22PM)
Note this only covers voluntary disruption, but does include other reasons (your example only includes Underutilized, not Empty, Drifted).
30 18 * * 6,0 means “At 18:30 on Saturday and Sunday.”
"My intention is not that, After I enabled this condition also the termination due to underutilized happens in business hours like monday to friday."
Off topic... but eh What are you trying to achieve exactly? I'm guessing you want to turn it around to blocking disruption during office hours, following example:
budgets:
- nodes: 20% # First budget is cycle max 20%
- duration: 18h # duration added after the schedule nodes: 0 # Actually setting budget to 0 nodes, meaning no changes schedule: "0 4 * * mon-fri" # 4AM monday to friday, start of block window. Add duration to determine end of window (here 22PM) Note this only covers voluntary disruption, but does include other reasons (your example only includes Underutilized, not Empty, Drifted).
"I have setup this condition ("30 18 * * 6,0") to happen termination of nodes only on sunday and saturdays due to undeutilized status, but the termination is happening in business hours also (monday to friday). "30 18 * * 6,0" this condition is not working here. why" this is my issue.
For others like "whenEmpty" and "drifted" I have stated other conditions and those are working as expected.
Without your complete budget overview, it is hard to troubleshoot. Most likely the budget you're pinpointing is not accurately defining a ceiling for the other budgets. In simple terms, saying that 20% of your nodes can cycle on Saturday and Sunday, does not impact what happens during all the other days. That is why typically you want to flip these conditions around to explicitly block changes on the defined interval (i.e. workdays).
Without your complete budget overview, it is hard to troubleshoot. Most likely the budget you're pinpointing is not accurately defining a ceiling for the other budgets. In simple terms, saying that 20% of your nodes can cycle on Saturday and Sunday, does not impact what happens during all the other days. That is why typically you want to flip these conditions around to explicitly block changes on the defined interval (i.e. workdays).
apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: karpenter spec: maxUnavailable: 1
the above is my PDB
Without your complete budget overview, it is hard to troubleshoot. Most likely the budget you're pinpointing is not accurately defining a ceiling for the other budgets. In simple terms, saying that 20% of your nodes can cycle on Saturday and Sunday, does not impact what happens during all the other days. That is why typically you want to flip these conditions around to explicitly block changes on the defined interval (i.e. workdays).
apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: karpenter spec: maxUnavailable: 1
the above is my PDB
He is talking about node disruption budgets in Karpeter nodepools, not PDB's for the karpenter pods itself. :)
Did you already read https://karpenter.sh/docs/concepts/disruption/ ? Lot's of info there.
We have an on-demand node pool with expireAfter: Never (comment). Karpenter: 1.3.2.
Testing Scenario:
- Scale up an application Deployment to trigger a new Karpenter NodeClaim and wait for the node to become ready.
- The new node has only the newly scaled pods and basic DaemonSet pods, such as aws-node, kube-proxy.
- Scale down the application to remove all pods from the new node.
- The new node is now empty.
- Karpenter moves all pods (20+) from an already utilized node to the newly created one.
- Karpenter deletes the old node.
Is it possible to disable this behavior in Karpenter? In this scenario, Cluster Autoscaler would simply remove the new node instead.
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: test-node
spec:
template:
metadata:
labels:
test-node: "true"
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
requirements:
- key: "node.kubernetes.io/instance-type"
operator: In
values: ["c6a.xlarge"]
- key: "kubernetes.io/arch"
operator: In
values: ["amd64"]
- key: "karpenter.sh/capacity-type"
operator: In
values: ["on-demand"]
expireAfter: Never
limits:
cpu: 56
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 5m
Tuning these parameters significantly reduced node churn in our setup (docs). However, Karpenter still removes nodes with higher CPU and memory requests, even when there are nodes with much lower requests.
settings:
batchMaxDuration: 30s
batchIdleDuration: 15s
I think this is related to this section of code where add() is called before relaxing the scheduling preferences and as a result the preferences are being treated as strictly required.
func (s *Scheduler) trySchedule(ctx context.Context, p *corev1.Pod) error {
for {
if ctx.Err() != nil {
return ctx.Err()
}
err := s.add(ctx, p)
if err == nil {
return nil
}
// We should only relax the pod's requirements when the error is not a reserved offering error because the pod may be
// able to schedule later without relaxing constraints. This could occur in this scheduling run, if other NodeClaims
// release the required reservations when constrained, or in subsequent runs. For an example, reference the following
// test: "shouldn't relax preferences when a pod fails to schedule due to a reserved offering error".
if IsReservedOfferingError(err) {
return err
}
// Eventually we won't be able to relax anymore and this while loop will exit
if relaxed := s.preferences.Relax(ctx, p); !relaxed {
return err
}
if e := s.topology.Update(ctx, p); e != nil && !errors.Is(e, context.DeadlineExceeded) {
log.FromContext(ctx).Error(e, "failed updating topology")
}
// Update the cached podData since the pod was relaxed, and it could have changed its requirement set
s.updateCachedPodData(p)
}
}
This is most likely related to https://github.com/kubernetes-sigs/karpenter/issues/666 as well and should be helped by https://github.com/kubernetes-sigs/karpenter/pull/2122.
Adding on to what @frimik said:
You can easily have the following scenario, and I have provoked these scenarios many times;
95%ish cpu and memory utilization across the nodepool, (after some time, randomness and luck you can see Karpenter reach a very nice and stable state like this)
all Karpenter chosen nodes being
mfamily.Now you Drain a node, because of how Karpenter happens to batch these pending pods you see the immediate replacement is a
cnode, followed by anothercnode. Now a few minutes later Karpenter decides to restructure other nodes even though these are also very well utilized because it decides it can rebalance things better… and the newer nodes now had very recent activity so theconsolidateAfter: NNprevents the actually BETTER candidate nodes to be replaced.
It's easy to imagine another situation where Karpenter decides to consolidate old nodes very frequently, caused by pod churn on a small subset of new nodes. ReplicaSets will choose to delete their newest pods first, so if a ReplicaSet scale-up caused Karpenter to create nodes, a corresponding scale-down may cause only those new nodes to be underutilized (or even empty). So what happens in a stabilized cluster where all nodes are "consolidatable", when a ReplicaSet like that scales down for a bit?
This is a simplified diagram of 1 app on 3 nodes. A black node is consolidatable, and blue is unconsolidatable. Total app requests on the node are represented by the red area.
Also, the default k8s scheduling behavior is to schedule pods to the least-utilized nodes, so with a relatively high-churn workload, it means the least-utilized nodes are the most likely to be unconsolidatable in Karpenter due to the consolidateAfter timer constantly resetting on them. In turn, this means any highly-utilized, "old" node which has passed its consolidateAfter time is immediately considered as a candidate for consolidating to the other less-utilized unconsolidatable nodes, creating a consistent cycle of node churn.
In cluster-autoscaler, this would normally be prevented by setting a node consolidation threshold, where only nodes <50% utilized (or whatever you set) are considered for consolidation. That approximates the effect of workloads on less-utilized nodes consolidating onto the more-utilized nodes. A node also has to be consolidatable for a long-enough time in cluster-autoscaler before it finally gets consolidated, preventing overly-frequent scaling.
Adding a threshold to Karpenter above which a node is not considered consolidatable would help resolve this. And to keep cost-related consolidation working on all nodes, I think an exception would have to be added for nodes above that allocation threshold where pods on that node are considered only for consolidation to a single cheaper node, not to any other node on the cluster.