volcano tasks in gang unschedulable

What happened: I just download the master branh of volcano and install it from helm chart, all of the volcano related pods are running, but when I run the example under example/deployment, it seems not work for me. it report "1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Undetermined" , but it works well if I create a pod with default scheduler.

the status of podgroup: the description of it:

the status of pod: description of pod:

What you expected to happen: it should be running, right? How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Volcano Version:
Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

Aug 08 '22 15:08 bingsenmu

/assign @waiterQ please help to take a look :)

Aug 10 '22 01:08 william-wang

Copy that. the reason pod pending show is NotEnoughResources, thats mean the cluster have not-enough cpu or mem resources. from nodeinfo kubectl describe node you can get node capacity and workloads requests, and how many resources allocatable. reduce workloads no-needs or lower the example deployments containers.resources.requests can help you start the example.

hope you enjoy volcano :)

Aug 10 '22 03:08 waiterQ

i am sure the cluster has enough cpu/memory resources, it works well if i remove the schedulerName field. also i change the request cpus to 100m, it still report the same error. my cluster info: the deployment yaml: the event of pod:

Aug 10 '22 11:08 bingsenmu

ok, well, i mistake your meant, sorry. please update deployment.apps/volcano-scheduler.spec.template.spec.containers.args -v=3 ->-v=5, paste predicate logs of pod/volcano-scheduler, like this:

Aug 11 '22 02:08 waiterQ

ok, well, i mistake your meant, sorry. please update deployment.apps/volcano-scheduler.spec.template.spec.containers.args -v=3 ->-v=5, paste predicate logs of pod/volcano-scheduler, like this:

i modified the args from v=3 -> v=5, but i didn't see the similar logs like you sent. i post the logs of volcano-scheduler here, hope it's helpful. thanks~

I0811 07:17:34.057947 I0811 07:17:34.058001 I0811 07:17:34.058081 I0811 07:17:34.058149 I0811 07:17:34.058299 I0811 07:17:34.058386 I0811 07:17:34.058424 I0811 07:17:34.058436 I0811 07:17:34.058466 I0811 07:17:34.058476 I0811 07:17:34.058495 I0811 07:17:34.060627 I0811 07:17:34.060651 I0811 07:17:34.060656 I0811 07:17:34.060664 I0811 07:17:34.060683 I0811 07:17:34.060694 I0811 07:17:34.060706 I0811 07:17:34.060714 I0811 07:17:34.060728 I0811 07:17:34.060744 I0811 07:17:34.064142 I0811 07:17:34.064158 I0811 07:17:34.064166 I0811 07:17:34.064174 I0811 07:17:34.064180 I0811 07:17:34.064186 I0811 07:17:34.064193 I0811 07:17:34.064205 I0811 07:17:34.064212 I0811 07:17:34.064219 I0811 07:17:34.064228 I0811 07:17:34.064233 I0811 07:17:34.064238 I0811 07:17:34.064245 I0811 07:17:34.064250 I0811 07:17:34.064255 I0811 07:17:34.064262 I0811 07:17:34.064268 I0811 07:17:34.064272 I0811 07:17:34.064376 I0811 07:17:34.064387 I0811 07:17:34.064392 1 scheduler.go:93] Start scheduling ... 1 node_info.go:277] set the node cn-shanghai.10.0.0.27 status to Ready. 1 node_info.go:277] set the node cn-shanghai.10.0.0.28 status to Ready. 1 node_info.go:277] set the node cn-shanghai.10.0.0.38 status to Ready. 1 node_info.go:277] set the node cn-shanghai.10.0.0.30 status to Ready. 1 cache.go:971] The priority of job <default/podgroup-a7e08473-32de-4e8f-8ab7-981c42d2e8c6> is </0> 1 cache.go:1009] There are <1> Jobs, <2> Queues and <4> Nodes in total for scheduling. 1 session.go:170] Open Session f69a982e-ad70-4826-821f-8705de034a3b with <1> Job and <2> Queues 1 overcommit.go:72] Enter overcommit plugin ... 1 overcommit.go:127] Leaving overcommit plugin. 1 drf.go:204] Total Allocatable cpu 35100.00, memory 179394097152.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00 1 proportion.go:80] The total resource is <cpu 35100.00, memory 179394097152.00, hugepages-2Mi 0.00, hugepages-1Gi 0.00> 1 proportion.go:88] The total guarantee resource is <cpu 0.00, memory 0.00> 1 proportion.go:91] Considering Job <default/podgroup-a7e08473-32de-4e8f-8ab7-981c42d2e8c6>. 1 proportion.go:124] Added Queue attributes. 1 proportion.go:182] Considering Queue : weight <1>, total weight <1>. 1 proportion.go:196] Format queue deserved resource to <cpu 100.00, memory 0.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00> 1 proportion.go:200] queue is meet 1 proportion.go:208] The attributes of queue in proportion: deserved <cpu 100.00, memory 0.00>, realCapability <cpu 2000.00, memory 179394097152.00>, allocate <cpu 0.00, memory 0.00>, request <cpu 100.00, memory 0.00>, share <0.00> 1 proportion.go:220] Remaining resource is <cpu 35000.00, memory 179394097152.00, hugepages-2Mi 0.00, hugepages-1Gi 0.00> 1 proportion.go:171] Exiting when total weight is 0 1 binpack.go:158] Enter binpack plugin ... 1 binpack.go:177] resources [] record in weight but not found on any node 1 binpack.go:161] Leaving binpack plugin. binpack.weight[1], binpack.cpu[1], binpack.memory[1], no extend resources. ... 1 enqueue.go:44] Enter Enqueue ... 1 enqueue.go:62] Added Queue for Job <default/podgroup-a7e08473-32de-4e8f-8ab7-981c42d2e8c6> 1 enqueue.go:73] Added Job <default/podgroup-a7e08473-32de-4e8f-8ab7-981c42d2e8c6> into Queue 1 enqueue.go:78] Try to enqueue PodGroup to 1 Queues 1 overcommit.go:114] Resource in cluster is overused, reject job <default/podgroup-a7e08473-32de-4e8f-8ab7-981c42d2e8c6> to be inqueue 1 enqueue.go:103] Leaving Enqueue ... 1 allocate.go:43] Enter Allocate ... 1 allocate.go:62] Job <default/podgroup-a7e08473-32de-4e8f-8ab7-981c42d2e8c6> Queue skip allocate, reason: job status is pending. 1 allocate.go:96] Try to allocate resource to 0 Namespaces 1 allocate.go:111] unlockedNode ID: a589e2f7-0147-486e-ba3d-3491904f241e, Name: cn-shanghai.10.0.0.27 1 allocate.go:111] unlockedNode ID: c19bb0e6-a34a-44fa-b983-0ea2f382f34a, Name: cn-shanghai.10.0.0.28 1 allocate.go:111] unlockedNode ID: 25fb647e-5734-43aa-9a61-6367049e200c, Name: cn-shanghai.10.0.0.38 1 allocate.go:111] unlockedNode ID: b29ed0d2-85c1-4a51-869d-bb9e348564bb, Name: cn-shanghai.10.0.0.30 1 allocate.go:283] Leaving Allocate ... 1 backfill.go:40] Enter Backfill ... 1 backfill.go:90] Leaving Backfill ... 1 cache.go:773] task unscheduleable default/deploy-with-volcano-d964bb946-t8s8m, message: 1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Undetermined, skip by no condition update 1 session.go:192] Close Session f69a982e-ad70-4826-821f-8705de034a3b 1 scheduler.go:112] End scheduling ...

Aug 11 '22 07:08 bingsenmu

wip, im analysing the offered infos, sorry to touch unassign-me

Aug 12 '22 02:08 waiterQ

whats k8s version your cluster?

Aug 12 '22 03:08 waiterQ

if could, clusters node info needs also

Aug 12 '22 04:08 waiterQ

if could, clusters node info needs also

my cluster's k8s version is 1.20.11 on aliyun cloud, the node info is(Please let me know if it meets your needs， thanks~）:

nodeInfo: architecture: amd64 bootID: 495b453e-5b95-4691-86e8-9bda2164f803 containerRuntimeVersion: docker://19.3.15 kernelVersion: 4.19.91-25.6.al7.x86_64 kubeProxyVersion: v1.20.11-aliyun.1 kubeletVersion: v1.20.11-aliyun.1 machineID: "xxxxx" operatingSystem: linux osImage: Alibaba Cloud Linux (Aliyun Linux) 2.1903 LTS (Hunting Beagle) systemUUID: 45be8658-0d19-466d-bd5b-1303f6805ace

Aug 15 '22 13:08 bingsenmu

Can you help to post the volcano configuration, it seems like it was rejected by the overcommit plugin.

I0811 07:17:34.064205 1 overcommit.go:114] Resource in cluster is overused, reject job <default/podgroup-a7e08473-32de-4e8f-8ab7-981c42d2e8c6> to be inqueue

Aug 18 '22 08:08 kerthcet

Can you help to post the volcano configuration, it seems like it was rejected by the overcommit plugin.
I0811 07:17:34.064205 1 overcommit.go:114] Resource in cluster is overused, reject job <default/podgroup-a7e08473-32de-4e8f-8ab7-981c42d2e8c6> to be inqueue

is the volcano_scheduler.conf?

Aug 29 '22 13:08 bingsenmu

any update? I met the same issue here.

Aug 31 '22 04:08 cauwulixuan

log printing need to be added in pkg/scheduler/plugins/overcommit/overcommit.go klog.V(4).Infof("node(%v) Allocatable:%s, Used:%s", node.Name, node.Allocatable, node.Used) klog.V(4).Infof("idleResource:%s, total:%s, overCommitFactor:%v, used:%s", op.idleResource, total, op.overCommitFactor, used) klog.V(4).Infof("jobMinReq:%s, idle:%s", jobMinReq, idle) then, recompile scheduler image and replace it , show logs again.

Sep 02 '22 09:09 waiterQ

Is it solved？

Oct 19 '22 07:10 jiamin13579

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

Jan 21 '23 10:01 stale[bot]

I met the same issue here.

Feb 27 '23 08:02 leonharetd

I made the following steps and it worked with volcano v1.7.0 kubectl get queue default -o yaml I found that queue has no information about allocated The dispatcher reported the same error as you.

Resource in cluster is overused, reject job  or queue <xxx> is meet

allocated It is upgraded from v1.6.0, Maybe I installed the previous version and didn't completely uninstall it. Then I reinstall volcano v1.7.0 kubectl get queue default -o yaml found allocated information. Finally it works well. I hope it will help you.

Feb 28 '23 03:02 leonharetd

I made the following steps and it worked with volcano v1.7.0 kubectl get queue default -o yaml I found that queue has no information about allocated The dispatcher reported the same error as you.
Resource in cluster is overused, reject job  or queue <xxx> is meet
allocated It is upgraded from v1.6.0, Maybe I installed the previous version and didn't completely uninstall it. Then I reinstall volcano v1.7.0 kubectl get queue default -o yaml found allocated information. Finally it works well. I hope it will help you.

Can you tell the workloads you use? The Resource in cluster is overused problem show-up mostly with queue's podGroup problem, check podGroup's resources can be a better way to figure out where is broken. There're another simple plan is abandon overcommit plugin (if no need rigorous queue resources limit).

For now Volcano, it still can not adjust very well with some workload which resources is always changes. hope this can help you : )

Feb 28 '23 07:02 waiterQ

I use this

https://github.com/volcano-sh/volcano/blob/79e6b749f5a7d4b77deb838632abf238b8754c66/example/task-start-dependency/mpi.yaml

Feb 28 '23 07:02 leonharetd

You can check the podgroup information of all inqueue states in the current system to see if there is a podgroup leak that occupies system resources @leonharetd

Mar 01 '23 03:03 wangyang0616

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

Jun 10 '23 01:06 stale[bot]

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

Aug 10 '23 01:08 stale[bot]

volcano volcano copied to clipboard

tasks in gang unschedulable

volcano
volcano copied to clipboard