volcano
volcano copied to clipboard
tasks in gang unschedulable
What happened: I just download the master branh of volcano and install it from helm chart, all of the volcano related pods are running, but when I run the example under example/deployment, it seems not work for me. it report "1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Undetermined" , but it works well if I create a pod with default scheduler.
the status of podgroup:
the description of it:
the status of pod:
description of pod:
What you expected to happen: it should be running, right? How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
- Volcano Version:
- Kubernetes version (use
kubectl version
): - Cloud provider or hardware configuration:
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a
): - Install tools:
- Others:
/assign @waiterQ please help to take a look :)
Copy that.
the reason pod pending show is NotEnoughResources, thats mean the cluster have not-enough cpu or mem resources. from nodeinfo kubectl describe node
you can get node capacity and workloads requests, and how many resources allocatable. reduce workloads no-needs or lower the example deployments containers.resources.requests can help you start the example.
hope you enjoy volcano :)
i am sure the cluster has enough cpu/memory resources, it works well if i remove the schedulerName field. also i change the request cpus to 100m, it still report the same error.
my cluster info:
the deployment yaml:
the event of pod:
ok, well, i mistake your meant, sorry. please update deployment.apps/volcano-scheduler.spec.template.spec.containers.args -v=3
->-v=5
, paste predicate logs of pod/volcano-scheduler, like this:
ok, well, i mistake your meant, sorry. please update deployment.apps/volcano-scheduler.spec.template.spec.containers.args
-v=3
->-v=5
, paste predicate logs of pod/volcano-scheduler, like this:
i modified the args from v=3 -> v=5, but i didn't see the similar logs like you sent. i post the logs of volcano-scheduler here, hope it's helpful. thanks~
I0811 07:17:34.057947 1 scheduler.go:93] Start scheduling ... I0811 07:17:34.058001 1 node_info.go:277] set the node cn-shanghai.10.0.0.27 status to Ready. I0811 07:17:34.058081 1 node_info.go:277] set the node cn-shanghai.10.0.0.28 status to Ready. I0811 07:17:34.058149 1 node_info.go:277] set the node cn-shanghai.10.0.0.38 status to Ready. I0811 07:17:34.058299 1 node_info.go:277] set the node cn-shanghai.10.0.0.30 status to Ready. I0811 07:17:34.058386 1 cache.go:971] The priority of job <default/podgroup-a7e08473-32de-4e8f-8ab7-981c42d2e8c6> is </0> I0811 07:17:34.058424 1 cache.go:1009] There are <1> Jobs, <2> Queues and <4> Nodes in total for scheduling. I0811 07:17:34.058436 1 session.go:170] Open Session f69a982e-ad70-4826-821f-8705de034a3b with <1> Job and <2> Queues I0811 07:17:34.058466 1 overcommit.go:72] Enter overcommit plugin ... I0811 07:17:34.058476 1 overcommit.go:127] Leaving overcommit plugin. I0811 07:17:34.058495 1 drf.go:204] Total Allocatable cpu 35100.00, memory 179394097152.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00 I0811 07:17:34.060627 1 proportion.go:80] The total resource is <cpu 35100.00, memory 179394097152.00, hugepages-2Mi 0.00, hugepages-1Gi 0.00> I0811 07:17:34.060651 1 proportion.go:88] The total guarantee resource is <cpu 0.00, memory 0.00> I0811 07:17:34.060656 1 proportion.go:91] Considering Job <default/podgroup-a7e08473-32de-4e8f-8ab7-981c42d2e8c6>. I0811 07:17:34.060664 1 proportion.go:124] Added Queue
attributes. I0811 07:17:34.060683 1 proportion.go:182] Considering Queue : weight <1>, total weight <1>. I0811 07:17:34.060694 1 proportion.go:196] Format queue deserved resource to <cpu 100.00, memory 0.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00> I0811 07:17:34.060706 1 proportion.go:200] queue is meet I0811 07:17:34.060714 1 proportion.go:208] The attributes of queue in proportion: deserved <cpu 100.00, memory 0.00>, realCapability <cpu 2000.00, memory 179394097152.00>, allocate <cpu 0.00, memory 0.00>, request <cpu 100.00, memory 0.00>, share <0.00> I0811 07:17:34.060728 1 proportion.go:220] Remaining resource is <cpu 35000.00, memory 179394097152.00, hugepages-2Mi 0.00, hugepages-1Gi 0.00> I0811 07:17:34.060744 1 proportion.go:171] Exiting when total weight is 0 I0811 07:17:34.064142 1 binpack.go:158] Enter binpack plugin ... I0811 07:17:34.064158 1 binpack.go:177] resources [] record in weight but not found on any node I0811 07:17:34.064166 1 binpack.go:161] Leaving binpack plugin. binpack.weight[1], binpack.cpu[1], binpack.memory[1], no extend resources. ... I0811 07:17:34.064174 1 enqueue.go:44] Enter Enqueue ... I0811 07:17:34.064180 1 enqueue.go:62] Added Queue for Job <default/podgroup-a7e08473-32de-4e8f-8ab7-981c42d2e8c6> I0811 07:17:34.064186 1 enqueue.go:73] Added Job <default/podgroup-a7e08473-32de-4e8f-8ab7-981c42d2e8c6> into Queue I0811 07:17:34.064193 1 enqueue.go:78] Try to enqueue PodGroup to 1 Queues I0811 07:17:34.064205 1 overcommit.go:114] Resource in cluster is overused, reject job <default/podgroup-a7e08473-32de-4e8f-8ab7-981c42d2e8c6> to be inqueue I0811 07:17:34.064212 1 enqueue.go:103] Leaving Enqueue ... I0811 07:17:34.064219 1 allocate.go:43] Enter Allocate ... I0811 07:17:34.064228 1 allocate.go:62] Job <default/podgroup-a7e08473-32de-4e8f-8ab7-981c42d2e8c6> Queue skip allocate, reason: job status is pending. I0811 07:17:34.064233 1 allocate.go:96] Try to allocate resource to 0 Namespaces I0811 07:17:34.064238 1 allocate.go:111] unlockedNode ID: a589e2f7-0147-486e-ba3d-3491904f241e, Name: cn-shanghai.10.0.0.27 I0811 07:17:34.064245 1 allocate.go:111] unlockedNode ID: c19bb0e6-a34a-44fa-b983-0ea2f382f34a, Name: cn-shanghai.10.0.0.28 I0811 07:17:34.064250 1 allocate.go:111] unlockedNode ID: 25fb647e-5734-43aa-9a61-6367049e200c, Name: cn-shanghai.10.0.0.38 I0811 07:17:34.064255 1 allocate.go:111] unlockedNode ID: b29ed0d2-85c1-4a51-869d-bb9e348564bb, Name: cn-shanghai.10.0.0.30 I0811 07:17:34.064262 1 allocate.go:283] Leaving Allocate ... I0811 07:17:34.064268 1 backfill.go:40] Enter Backfill ... I0811 07:17:34.064272 1 backfill.go:90] Leaving Backfill ... I0811 07:17:34.064376 1 cache.go:773] task unscheduleable default/deploy-with-volcano-d964bb946-t8s8m, message: 1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Undetermined, skip by no condition update I0811 07:17:34.064387 1 session.go:192] Close Session f69a982e-ad70-4826-821f-8705de034a3b I0811 07:17:34.064392 1 scheduler.go:112] End scheduling ...
wip, im analysing the offered infos, sorry to touch unassign-me
whats k8s version your cluster?
if could, clusters node info needs also
if could, clusters node info needs also
my cluster's k8s version is 1.20.11 on aliyun cloud, the node info is(Please let me know if it meets your needs, thanks~):
nodeInfo: architecture: amd64 bootID: 495b453e-5b95-4691-86e8-9bda2164f803 containerRuntimeVersion: docker://19.3.15 kernelVersion: 4.19.91-25.6.al7.x86_64 kubeProxyVersion: v1.20.11-aliyun.1 kubeletVersion: v1.20.11-aliyun.1 machineID: "xxxxx" operatingSystem: linux osImage: Alibaba Cloud Linux (Aliyun Linux) 2.1903 LTS (Hunting Beagle) systemUUID: 45be8658-0d19-466d-bd5b-1303f6805ace
Can you help to post the volcano configuration, it seems like it was rejected by the overcommit plugin.
I0811 07:17:34.064205 1 overcommit.go:114] Resource in cluster is overused, reject job <default/podgroup-a7e08473-32de-4e8f-8ab7-981c42d2e8c6> to be inqueue
Can you help to post the volcano configuration, it seems like it was rejected by the overcommit plugin.
I0811 07:17:34.064205 1 overcommit.go:114] Resource in cluster is overused, reject job <default/podgroup-a7e08473-32de-4e8f-8ab7-981c42d2e8c6> to be inqueue
is the volcano_scheduler.conf?
any update? I met the same issue here.
log printing need to be added in pkg/scheduler/plugins/overcommit/overcommit.go
klog.V(4).Infof("node(%v) Allocatable:%s, Used:%s", node.Name, node.Allocatable, node.Used)
klog.V(4).Infof("idleResource:%s, total:%s, overCommitFactor:%v, used:%s", op.idleResource, total, op.overCommitFactor, used)
klog.V(4).Infof("jobMinReq:%s, idle:%s", jobMinReq, idle)
then, recompile scheduler image and replace it , show logs again.
Is it solved?
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
I met the same issue here.
I made the following steps and it worked with volcano v1.7.0
kubectl get queue default -o yaml
I found that queue has no information about allocated
The dispatcher reported the same error as you.
Resource in cluster is overused, reject job or queue <xxx> is meet
allocated It is upgraded from v1.6.0, Maybe I installed the previous version and didn't completely uninstall it. Then I reinstall volcano v1.7.0
kubectl get queue default -o yaml
found allocated information. Finally it works well.
I hope it will help you.
I made the following steps and it worked with volcano v1.7.0 kubectl get queue default -o yaml I found that queue has no information about allocated
The dispatcher reported the same error as you.
Resource in cluster is overused, reject job or queue <xxx> is meet
allocated It is upgraded from v1.6.0, Maybe I installed the previous version and didn't completely uninstall it. Then I reinstall volcano v1.7.0 kubectl get queue default -o yaml
found allocated information. Finally it works well. I hope it will help you.
Can you tell the workloads you use? The Resource in cluster is overused
problem show-up mostly with queue's podGroup problem, check podGroup's resources can be a better way to figure out where is broken.
There're another simple plan is abandon overcommit plugin (if no need rigorous queue resources limit).
For now Volcano, it still can not adjust very well with some workload which resources is always changes. hope this can help you : )
I use this
https://github.com/volcano-sh/volcano/blob/79e6b749f5a7d4b77deb838632abf238b8754c66/example/task-start-dependency/mpi.yaml
You can check the podgroup information of all inqueue states in the current system to see if there is a podgroup leak that occupies system resources @leonharetd
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗