karpenter
karpenter copied to clipboard
The `batchIdleDuration` value is ignored
Description
Observed Behavior:
Karpenter always wait to the batchMaxDuration
value and ignore the batchIdleDuration
value when different pods are in pending status.
I change the above values to batchIdleDuration=10s and batchMaxDuration=90s so it's clearer the behaviour.
As you can see in the following image, pods are in pending state after more than 60 seconds and karpenter was not scheduling new nodeClaim yet.
Here are the controller logs:
{"level":"DEBUG","time":"2024-04-08T17:36:48.408Z","logger":"controller.nodeclass","message":"discovered security groups","commit":"17dd42b","ec2nodeclass":"default","security-groups":["sg-*******"]}
{"level":"DEBUG","time":"2024-04-08T17:36:48.417Z","logger":"controller.nodeclass","message":"discovered kubernetes version","commit":"17dd42b","ec2nodeclass":"default","version":"1.29"}
{"level":"DEBUG","time":"2024-04-08T17:36:48.421Z","logger":"controller.pricing","message":"updated spot pricing with instance types and offerings","commit":"17dd42b","instance-type-count":762,"offering-count":1927}
{"level":"DEBUG","time":"2024-04-08T17:36:48.605Z","logger":"controller.nodeclass","message":"discovered amis","commit":"17dd42b","ec2nodeclass":"default","ids":"ami-00a591fc812caa612, ami-0fba9088684638922, ami-0fba9088684638922, ami-0ec163aec4a900dc0","count":4}
{"level":"DEBUG","time":"2024-04-08T17:36:49.553Z","logger":"controller.pricing","message":"updated on-demand pricing","commit":"17dd42b","instance-type-count":706}
{"level":"DEBUG","time":"2024-04-08T17:36:50.692Z","logger":"controller.disruption","message":"discovered instance types","commit":"17dd42b","count":704}
{"level":"DEBUG","time":"2024-04-08T17:36:50.850Z","logger":"controller.disruption","message":"discovered offerings for instance types","commit":"17dd42b","instance-type-count":706}
{"level":"DEBUG","time":"2024-04-08T17:36:50.850Z","logger":"controller.disruption","message":"discovered zones","commit":"17dd42b","zones":["eu-central-1c","eu-central-1b","eu-central-1a"]}
{"level":"DEBUG","time":"2024-04-08T17:38:51.395Z","logger":"controller.disruption","message":"discovered subnets","commit":"17dd42b","subnets":["subnet-******** (eu-central-1b)","subnet-******** (eu-central-1c)","subnet-******** (eu-central-1a)"]}
{"level":"DEBUG","time":"2024-04-08T17:45:31.914Z","logger":"controller.provisioner","message":"73 out of 704 instance types were excluded because they would breach limits","commit":"17dd42b","nodepool":"nodepool-eu-central-1a"}
{"level":"INFO","time":"2024-04-08T17:45:31.936Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"17dd42b","pods":"default/pause1, default/pause2, default/pause3, default/pause4, default/pause5 and 5 other(s)","duration":"84.72464ms"}
{"level":"INFO","time":"2024-04-08T17:45:31.936Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"17dd42b","nodeclaims":1,"pods":10}
{"level":"INFO","time":"2024-04-08T17:45:31.952Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"17dd42b","nodepool":"nodepool-eu-central-1a","nodeclaim":"nodepool-eu-central-1a-8rzln","requests":{"cpu":"210m","memory":"41200Mi","pods":"14"},"instance-types":"c6a.12xlarge, c6a.16xlarge, c6a.24xlarge, c6a.32xlarge, c6a.8xlarge and 55 other(s)"}
{"level":"DEBUG","time":"2024-04-08T17:45:32.461Z","logger":"controller.nodeclaim.lifecycle","message":"created launch template","commit":"17dd42b","nodeclaim":"nodepool-eu-central-1a-8rzln","launch-template-name":"karpenter.k8s.aws/12721094165069015774","id":"lt-0b41adc259308f38f"}
{"level":"DEBUG","time":"2024-04-08T17:45:34.326Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"17dd42b"}
{"level":"INFO","time":"2024-04-08T17:45:34.478Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"17dd42b","nodeclaim":"nodepool-eu-central-1a-8rzln","provider-id":"aws:///eu-central-1a/i-0a8468578de536001","instance-type":"z1d.3xlarge","zone":"eu-central-1a","capacity-type":"spot","allocatable":{"cpu":"11900m","ephemeral-storage":"224Gi","memory":"88002Mi","pods":"110","vpc.amazonaws.com/pod-eni":"54"}}
{"level":"DEBUG","time":"2024-04-08T17:45:55.545Z","logger":"controller.disruption","message":"discovered subnets","commit":"17dd42b","subnets":["subnet-******** (eu-central-1a)","subnet-******** (eu-central-1b)","subnet-******** (eu-central-1c)"]}
{"level":"INFO","time":"2024-04-08T17:46:11.252Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","commit":"17dd42b","nodeclaim":"nodepool-eu-central-1a-8rzln","provider-id":"aws:///eu-central-1a/i-0a8468578de536001","node":"ip-10-1-6-176.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-04-08T17:46:24.385Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"17dd42b","nodeclaim":"nodepool-eu-central-1a-8rzln","provider-id":"aws:///eu-central-1a/i-0a8468578de536001","node":"ip-10-1-6-176.eu-central-1.compute.internal","allocatable":{"cpu":"11900m","ephemeral-storage":"240506825124","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"94711256Ki","pods":"110"}}
{"level":"DEBUG","time":"2024-04-08T17:46:57.299Z","logger":"controller.disruption","message":"discovered subnets","commit":"17dd42b","subnets":["subnet-******** (eu-central-1b)","subnet-******** (eu-central-1c)","subnet-******** (eu-central-1a)"]}
Expected Behavior:
The NodeClaim should be created when the batchIdleDuration
time is passed and no new pending pods have been scheduled on cluster.
Versions:
- Chart Version:
0.35.4
- Kubernetes Version (
kubectl version
):v1.29.1-eks-508b6b3
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
@MauroSoli Just to be sure, would you mind verifying if BATCH_MAX_DURATION
and BATCH_IDLE_DURATION
are being set properly. You can do this by running kubectl describe pod <karpenter pod>
. Also there is not much that we can see from the logs that you have shared. Can you share logs with message found provisionable pod(s)
to see when the actual scheduling for pods happened?
What is the size of the cluster that you are working with? Is it a test cluster that does not have too many pods/deployments? You have mentioned that the pod is in pending state for more than 60 sec. Given that the node can take some time to come up, do you expect to see the pods scheduled right after 10 seconds?
Just to be sure, would you mind verifying if BATCH_MAX_DURATION and BATCH_IDLE_DURATION are being set properly. You can do this by running kubectl describe pod
$ kubectl describe pod karpenter-678d69d4d5-6rpgw -n karpenter | grep BATCH_
BATCH_MAX_DURATION: 90s
BATCH_IDLE_DURATION: 10s
Also there is not much that we can see from the logs that you have shared. Can you share logs with message found provisionable pod(s) to see when the actual scheduling for pods happened?
In the logs that I've shared there is the log that you are searching for:
{"level":"INFO","time":"2024-04-08T17:45:31.936Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"17dd42b","pods":"default/pause1, default/pause2, default/pause3, default/pause4, default/pause5 and 5 other(s)","duration":"84.72464ms"}
What is the size of the cluster that you are working with? Is it a test cluster that does not have too many pods/deployments?
We use Karpenter only to manage some specific workloads, like building or running cron jobs. The cluster-autoscaler manages all other workload.
You have mentioned that the pod is in pending state for more than 60 sec. Given that the node can take some time to come up, do you expect to see the pods scheduled right after 10 seconds?
That's not what I was meaning.
After 60 seconds, while the pods were pending, karpenter was not scheduling new nodeClaim yet. Which instead happened after 90 seconds.
Let's say you scheduled a single pod and until 10 seconds there is no new pod then Karpenter will begin scheduling new nodeClaim after 10 seconds. However if a pod comes up before 10 seconds the batching window will be extended up to the maxDuration.
From the logs that you have shared it seems like there were 10 pods and in that case Karpenter would wait until the batchMaxDuration
of 90s before it begins scheduling a nodeClaim.
I've been able to reproduce this using the latest Karpenter v0.35.4 and the latest EKS v1.29.1-eks-508b6b3 and it seems to me there could be some misunderstanding between what is stated above, what is stated in the documentation and what is the actual Karpenter behavior.
If I set high and far enough times for IDLE_DURATION and MAX_DURATION (let's say 10 seconds and 300 seconds) and run a script like for I in $(seq 10); do kubectl run <single_pod>; sleep 2; done
, the following happens:
- new pods are created
- the "nothing happens" interval between one pod and the next is always less than 10 seconds
- the whole pods creation takes more than IDLE_DURATION
- after the last pod gets created, Karpenter waits until 300 seconds has passed since the time of creation of the first pod
- Karpenter creates new nodeclaim(s) to fit all the pending pods
In other words: the batch window gets immediately extended to BATCH_MAX_DURATION
after the creation of the second pod is interrupting the first BATCH_IDLE_DURATION
interval.
Personally, I understood the current documentation in a different way. It says "BATCH_IDLE_DURATION
The maximum amount of time with no new pending pods that if exceeded ends the current batching window. If pods arrive faster than this time, the batching window will be extended up to the maxDuration. If they arrive slower, the pods will be batched separately.". I understood it as follows:
- one or more new pods are created
- Karpenter notices and starts waiting the IDLE duration
- if nothing new happens, the batch window is immediately closed and node claims are computed
- otherwise, if new pods are created before
BATCH_IDLE_DURATION
, jump to 2 (starting a new wait for IDLE duration) - if pods keeps getting created before an IDLE_DURATION ends, Karpenter will keep extending the batch window at blocks of IDLE_DURATION unless the whole window (defined as
now - time of first pod created
) exceedsBATCH_MAX_DURATION
- if the window extends more than
BATCH_MAX_DURATION
, it is immediatly closed, node claims are computed, and the whole process starts again from 1
In other words: BATCH_IDLE_DURATION will always be the maximum "inactivity time" after which Karpenter starts computing node claims. To me, this makes much more sense because it allows shorter latencies between workload demands (pods creation) and its execution (node claims computed, nodes created, pods start running) while the MAX_DURATION still guarantees a maximum limit to said latency by closing batch windows even if new pods keep arriving.
To me, the documentation seems correct and describes the behavior I would expect, but Karpenter is misbehaving and actually uses the IDLE_DURATION "only the first time", then skipping directly to the MAX_DURATION and thus causing higher latencies between the pods creation and the nodes starting.
I had probably misunderstood what was being implied earlier. I was able to reproduce this. Looking into creating a fix for this. Thanks.
/assign @jigisha620
/label v1
@billrayburn: The label(s) /label v1
cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda, refactor
. Is this label configured under labels -> additional_labels
or labels -> restricted_labels
in plugin.yaml
?
In response to this:
/label v1
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale