karpenter
karpenter copied to clipboard
Custom k8s scheduler support for Karpenter e.g., Apache YuniKorn, Volcano
Tell us about your request
-
Add Karpenter support to work with custom schedulers (e.g., Apache Yunikorn, Volcano)
-
As per my understanding, Karpenter works only with the default scheduler to schedule the pods. However, It's prevalent among the Data on Kubernetes community to use custom schedulers like Apache YuniKorn or Volcano for running Spark jobs on Amazon EKS.
-
With the requested feature, Karpenter is effectively used as Autoscaler for spinning up new nodes while YuniKorn and Volcano handle scheduling decisions.
Please correct me and provide some context if this feature is already supported.
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
Using Apache YuniKorn/Volcano is becoming a basic requirement for running batch workloads(e.g., Spark) on Kubernetes. These schedulers are more application aware unlike default scheduler and provides number of other useful features(e.g., resource queues, sorting the jobs) for running multi-tenant data workloads on Kubernetes(Amazon EKS).
At this moment we could only use cluster autoscaler with these custom schedulers but it would be beneficial to add Karpenter support to leverage performance optimised by Karpenter over Cluster Autoscaler.
Are you currently working around this issue?
No, We are using Cluster Autoscaler as an alternative option to work with custom schedulers.
Additional Context
No response
Attachments
No response
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Hey @vara-bonthu , Thanks for the feature request. This feature is not supported currently and is not yet on our roadmap. It sounds like this would be a pretty significant amount of effort given Karpenter would have to adhere to the scheduling decisions of multiple custom schedulers.
While I have not personally tested a custom scheduler with Karpenter, it should be able to at least launch nodes even if a custom scheduler is in use (Karpenter simply watches for pending pods and then spawns nodes accordingly). Though as you mentioned, Karpenter is built to adhere to the scheduling decisions of kube-scheduler. So it's certainly possible you would run across some cases where Karpenter makes incorrect decisions when a custom scheduler is in the mix.
If you have a configuration you could share, it would be fun to do some testing with custom schedulers to see how Karpenter responds.
I'd love to learn how customer schedulers make different decisions than the kube scheduler. Technically, we're agnostic of the kube scheduler, but we support the pod spec fields that impact scheduling. Do custom schedulers respect all of those fields?
Could you provide a concrete example workflow of what decisions you'd like to see karpenter make when working alongside a custom scheduler?
I've tested Karpenter with Volcano successfully. There was one issue (PR to fix in https://github.com/volcano-sh/volcano/pull/2602) that was causing Volcano to use an unconventional Reason that prevented Karpenter from triggering scale-up, but once this PR lands things should be working again.
@ellistarn
I think the initial issue that i encountered could be due to the older version of Apache YuniKorn 0.12.1 where i installed YuniKorn as a secondary k8s scheduler. Also, I didn't enable the admission controller. This error could be caused by multiple schedulers running at the sametime. Here is the log from old tests. However the good news is that it works with latest version. Please see for more details below.
YuniKorn Scheduler Error Summary
ERROR external/scheduler_cache.go:203 pod updated on a different node than previously added to
ERROR external/scheduler_cache.go:204 scheduler cache is corrupted and can badly affect scheduling decisions
YuniKorn Scheduler full log
2022-01-31T17:12:08.558Z INFO cache/context.go:552 app added {"appID": "spark-79df83d15b6843b7bb1cac31e7135e9c"}
2022-01-31T17:12:08.559Z INFO cache/context.go:612 task added {"appID": "spark-79df83d15b6843b7bb1cac31e7135e9c", "taskID": "9d00be6e-8c41-4c35-a089-5b6932e339ac", "taskState": "New"}
2022-01-31T17:12:09.253Z INFO cache/application.go:436 handle app submission {"app": "applicationID: spark-79df83d15b6843b7bb1cac31e7135e9c, queue: root.spark, partition: default, totalNumOfTasks: 1, currentState: Submitted", "clusterID": "mycluster"}
2022-01-31T17:12:09.254Z INFO placement/tag_rule.go:114 Tag rule application placed {"application": "spark-79df83d15b6843b7bb1cac31e7135e9c", "queue": "root.spark-k8s-data-team-a"}
2022-01-31T17:12:09.254Z INFO objects/queue.go:150 dynamic queue added to scheduler {"queueName": "root.spark-k8s-data-team-a"}
2022-01-31T17:12:09.254Z INFO scheduler/context.go:495 Added application to partition {"applicationID": "spark-79df83d15b6843b7bb1cac31e7135e9c", "partitionName": "[mycluster]default", "requested queue": "root.spark", "placed queue": "root.spark-k8s-data-team-a"}
2022-01-31T17:12:09.254Z INFO callback/scheduler_callback.go:108 Accepting app {"appID": "spark-79df83d15b6843b7bb1cac31e7135e9c"}
2022-01-31T17:12:10.254Z INFO cache/application.go:531 Skip the reservation stage {"appID": "spark-79df83d15b6843b7bb1cac31e7135e9c"}
2022-01-31T17:12:11.256Z INFO objects/application_state.go:128 Application state transition {"appID": "spark-79df83d15b6843b7bb1cac31e7135e9c", "source": "New", "destination": "Accepted", "event": "runApplication"}
2022-01-31T17:12:11.256Z INFO objects/application.go:531 Ask added successfully to application {"appID": "spark-79df83d15b6843b7bb1cac31e7135e9c", "ask": "9d00be6e-8c41-4c35-a089-5b6932e339ac", "placeholder": false, "pendingDelta": "map[memory:12885 vcore:4000]"}
2022-01-31T17:12:15.360Z INFO cache/nodes.go:112 adding node to context {"nodeName": "ip-10-1-10-119.eu-west-1.compute.internal", "nodeLabels": "{\"karpenter.sh/capacity-type\":\"spot\",\"karpenter.sh/provisioner-name\":\"default\",\"node.kubernetes.io/instance-type\":\"m5.4xlarge\",\"topology.kubernetes.io/zone\":\"eu-west-1a\"}", "schedulable": true}
2022-01-31T17:12:15.361Z INFO cache/node.go:148 node recovering {"nodeID": "ip-10-1-10-119.eu-west-1.compute.internal", "schedulable": true}
2022-01-31T17:12:15.361Z INFO scheduler/partition.go:548 adding node to partition {"partition": "[mycluster]default", "nodeID": "ip-10-1-10-119.eu-west-1.compute.internal"}
2022-01-31T17:12:15.362Z INFO scheduler/partition.go:613 Updated available resources from added node {"partitionName": "[mycluster]default", "nodeID": "ip-10-1-10-119.eu-west-1.compute.internal", "partitionResource": "map[attachable-volumes-aws-ebs:50 ephemeral-storage:142784976248 hugepages-1Gi:0 hugepages-2Mi:0 memory:86159 pods:321 vcore:21850]"}
2022-01-31T17:12:15.363Z INFO scheduler/context.go:592 successfully added node {"nodeID": "ip-10-1-10-119.eu-west-1.compute.internal", "partition": "[mycluster]default"}
2022-01-31T17:12:15.375Z ERROR external/scheduler_cache.go:203 pod updated on a different node than previously added to {"pod": "9d00be6e-8c41-4c35-a089-5b6932e339ac"}
github.com/apache/incubator-yunikorn-k8shim/pkg/cache/external.(*SchedulerCache).UpdatePod
/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/pkg/cache/external/scheduler_cache.go:203
github.com/apache/incubator-yunikorn-k8shim/pkg/cache.(*Context).updatePodInCache
/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/pkg/cache/context.go:253
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate
/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/client-go/tools/cache/controller.go:238
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnUpdate
/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/client-go/tools/cache/controller.go:273
k8s.io/client-go/tools/cache.(*processorListener).run.func1
/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/client-go/tools/cache/shared_informer.go:775
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run
/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/client-go/tools/cache/shared_informer.go:771
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1
/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73
2022-01-31T17:12:15.376Z ERROR external/scheduler_cache.go:204 scheduler cache is corrupted and can badly affect scheduling decisions
github.com/apache/incubator-yunikorn-k8shim/pkg/cache/external.(*SchedulerCache).UpdatePod
/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/pkg/cache/external/scheduler_cache.go:204
github.com/apache/incubator-yunikorn-k8shim/pkg/cache.(*Context).updatePodInCache
/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/pkg/cache/context.go:253
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate
/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/client-go/tools/cache/controller.go:238
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnUpdate
/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/client-go/tools/cache/controller.go:273
k8s.io/client-go/tools/cache.(*processorListener).run.func1
/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/client-go/tools/cache/shared_informer.go:775
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run
/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/client-go/tools/cache/shared_informer.go:771
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1
/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73
2022-01-31T17:12:38.125Z INFO configs/configwatcher.go:143 config watcher timed out
However In my recent tests with the following setup, I can confirm that Karpenter is working well with Apache YuniKorn scheduler.
Apache YuniKorn Version: 1.1.0 deployed as default scheduler. This overrides the k8s default scheduler
Karpenter Version: v0.20.0
EMR on EKS Spark jobs are working as expected with Apache YuniKorn Gang scheduling along with Karpenter autoscaling.
If you are interested to know how to setup Karpenter with Apache YuniKorn (with Gang Scheduling) then you can refer to Data on EKS docs (https://github.com/awslabs/data-on-eks/tree/main/analytics/terraform/emr-eks-karpenter).
EMR on EKS Spark job with 1 Driver and 20 Executors with Apache YuniKorn Scheduling logs
Karpenter logs
2022-12-22T17:50:54.846Z DEBUG controller.aws deleted launch template {"commit": "f60dacd"}
2022-12-22T17:51:25.330Z INFO controller.provisioner found provisionable pod(s) {"commit": "f60dacd", "pods": 21}
2022-12-22T17:51:25.331Z INFO controller.provisioner computed new node(s) to fit pod(s) {"commit": "f60dacd", "nodes": 2, "pods": 21}
2022-12-22T17:51:25.331Z INFO controller.provisioner launching node with 6 pods requesting {"cpu":"7355m","memory":"92280Mi","pods":"9"} from types r5d.4xlarge, r5d.8xlarge {"commit": "f60dacd", "provisioner": "spark-memory-optimized"}
2022-12-22T17:51:25.351Z INFO controller.provisioner launching node with 15 pods requesting {"cpu":"18155m","memory":"230520Mi","pods":"18"} from types r5d.8xlarge {"commit": "f60dacd", "provisioner": "spark-memory-optimized"}
2022-12-22T17:51:25.773Z DEBUG controller.provisioner.cloudprovider created launch template {"commit": "f60dacd", "provisioner": "spark-memory-optimized", "launch-template-name": "Karpenter-emr-eks-karpenter-2497088801825500229", "launch-template-id": "lt-030a5b323b302d61a"}
2022-12-22T17:51:27.766Z INFO controller.provisioner.cloudprovider launched new instance {"commit": "f60dacd", "provisioner": "spark-memory-optimized", "launched-instance": "i-0ef04f248f444ec79", "hostname": "ip-10-1-126-194.us-west-2.compute.internal", "type": "r5d.4xlarge", "zone": "us-west-2b", "capacity-type": "spot"}
2022-12-22T17:51:29.941Z INFO controller.provisioner.cloudprovider launched new instance {"commit": "f60dacd", "provisioner": "spark-memory-optimized", "launched-instance": "i-0d948a89719abfacf", "hostname": "ip-10-1-67-206.us-west-2.compute.internal", "type": "r5d.8xlarge", "zone": "us-west-2b", "capacity-type": "spot"}
2022-12-22T17:59:04.511Z INFO controller.node added TTL to empty node {"commit": "f60dacd", "node": "ip-10-1-67-206.us-west-2.compute.internal"}
Apache YuniKorn placement pods triggered the nodes required to run the Spark Job by the Karpenter.

Happy to close this issue.
I've tested Karpenter with Volcano successfully. There was one issue (PR to fix in https://github.com/volcano-sh/volcano/pull/2602) that was causing Volcano to use an unconventional
Reasonthat prevented Karpenter from triggering scale-up, but once this PR lands things should be working again.
@tgaddair Do you mind sharing your experience using Volcano with cluster autoscalers? We are building a platform supporting multi tenant jobs and planning to use Volcano for gang-scheduling, however there is very limited amount of info on best practices of scaling clusters while supporting gang-scheduling, even in Volcano docs. What made you chose Karpenter over cluster-autoscaler? If you have been successfully running Volcano + Karpenter setup, what is your experience? This topic can be a separate post somewhere, this would be pretty valuable.
@tgaddair please share your Volcano + Karpenter when you get a chance. It would be very valuable for others exploring the two and save many a lot of effort.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
Does Karpneter have plan to support something like this(generally for AI jobs)? https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/provisioning-request.md
cc @ellistarn
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
Does Karpneter have plan to support something like this(generally for AI jobs)? https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/provisioning-request.md
Plus one on this question. To get correct gang scheduling behavior with on-demand compute, the underlying cloud provider has to support "all-or-nothing" provisioning of sets of VMs. (You should be ask for N VMs in a single request and either get N or 0, but nothing in between.) GCP makes this possible with Dynamic Workload Scheduler. I'm not sure what the status is for AWS and other cloud providers.
I also tested Karpenter with Volcano, however I noticed Karpenter can stall in between once some of the nodes get provisioned or already available and will not try to scale up to bring more nodes.
Steps to reproduce (Karpenter 1.0+)
- Create a NodePool for g5 instances with limit
nvidia.com/gpu: 1 - Create a PodGroup for 2 pods asking for g5.xlarge on-demand type instances (and resources.limits
nvidia.com/gpu: 1) - Karpenter will provision one node
- Increase the limit on NodePool - set
nvidia.com/gpu: 2 - Karpenter does not try to bring in another node.
Contrary to this if we set no limits and re-create the whole PodGroup fresh, Karpenter is able to consider both pending pods simulatenously and provision 2 nodes. But I am not sure if this behavior is deterministic.
It would help if Karpenter has all or nothing mode - as AWS Batch can do this on EKS - https://aws.amazon.com/blogs/hpc/gang-scheduling-pods-on-amazon-eks-using-aws-batch-multi-node-processing-jobs/
I also tested Karpenter with Volcano, however I noticed Karpenter can stall in between once some of the nodes get provisioned or already available and will not try to scale up to bring more nodes.
What's the volcano version are you using?
What's the volcano version are you using?
1.10.0
What's the volcano version are you using?
1.10.0
Sorry I don't quite understand about step 4, seems you don't scale new pods, so why would you expect karpenter scale up?
/priority awaiting-more-evidence /triage accepted
Hi @jonathan-innis Thanks for your reply, hope you're all doing well : ) With the rapid development of LLMs, the demand for distributed training and inference is growing exponentially. Traditional pod scheduling and scaling methods face challenges in large model scenarios, particularly due to the lack of gang scheduling capabilities. While custom schedulers have effectively addressed gang scheduling such as volcano, scaling issues persist.
For instance, consider a distributed inference task requiring 4 pod replicas. In a cluster with only two nodes, each capable of hosting one replica, users expect all four replicas to be scheduled simultaneously (gang scheduling). However, current schedulers may mark two pods as Schedulable and the remaining two as Unschedulable. Karpenter, focusing solely on individual pod conditions, might only process the two Unschedulable pods. Seeing two available nodes, Karpenter might assume these pods can be scheduled without triggering scaling, resulting in a deadlock.
To address this, Karpenter could enhance its scaling decisions by considering the collective state of a pod group. This improvement would significantly benefit LLM scenarios and the broader ecosystem. Additionally, since users might adopt different schedulers, Karpenter's simulated scheduling plugins may not align with the actual scheduler's view, leading to inaccurate scaling. Therefore, opening Karpenter's interfaces to support simulated scheduling across different schedulers would provide a more comprehensive solution.
And also I posted several issues that users are concerned about, hope that can help!
The problem users encountered when using karpenter and custom scheduler: https://github.com/volcano-sh/volcano/issues/4030 https://github.com/volcano-sh/volcano/issues/3910 https://github.com/volcano-sh/volcano/issues/4041
Sorry I don't quite understand about step 4, seems you don't scale new pods, so why would you expect karpenter scale up?
Because the pod group already has 2 pods in pending (Step 2), but initially the karpenter nodepool only allowed to scale up to 1 g5.xlarge (step 1 and 3), once that limit is raised (step 4), the other pending pod should get karpenter to scale up another g5.xlarge
This is just empirical observation, I have not read through Karpenter's algorithm to say for certain that is a known limitation.
Hi, we've hit two very similar issues when testing karpenter with volcano podgroups
- Pending pods, but unschedulable due to queue capacitty still trigger scale ups.
- Pending pods, but unschedulable due to queue capacitty will prevent the node from being disrupted.
Cannot disrupt Node: state node is nominated for a pending pod
I'm not sure if support for this should be done on volcano or karpenter side. From karpenter side, it could look at the pod conditions
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2025-04-01T12:39:59Z"
message: 'pod group is not ready, 1 Pending, 1 minAvailable, 2 Running; Pending:
1 Unschedulable'
reason: Unschedulable
status: "False"
type: PodScheduled
but that doesn't seem really standard...
@jonathan-innis Hi it seems that there are already many cases showing that users demand to use Volcano + karpenter, but encounter some missing functionality, which require Karpenter to adapt. I am happy to contribute :) Can this issue be moved forward now?