volcano icon indicating copy to clipboard operation
volcano copied to clipboard

tasks in gang unschedulable

Open bingsenmu opened this issue 2 years ago • 13 comments

What happened: I just download the master branh of volcano and install it from helm chart, all of the volcano related pods are running, but when I run the example under example/deployment, it seems not work for me. it report "1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Undetermined" , but it works well if I create a pod with default scheduler.

the status of podgroup: image the description of it: image

the status of pod: image description of pod: image

What you expected to happen: it should be running, right? How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Volcano Version:
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

bingsenmu avatar Aug 08 '22 15:08 bingsenmu

/assign @waiterQ please help to take a look :)

william-wang avatar Aug 10 '22 01:08 william-wang

Copy that. the reason pod pending show is NotEnoughResources, thats mean the cluster have not-enough cpu or mem resources. from nodeinfo kubectl describe node you can get node capacity and workloads requests, and how many resources allocatable. reduce workloads no-needs or lower the example deployments containers.resources.requests can help you start the example.

hope you enjoy volcano :)

waiterQ avatar Aug 10 '22 03:08 waiterQ

i am sure the cluster has enough cpu/memory resources, it works well if i remove the schedulerName field. also i change the request cpus to 100m, it still report the same error. my cluster info: image the deployment yaml: image the event of pod: image

bingsenmu avatar Aug 10 '22 11:08 bingsenmu

ok, well, i mistake your meant, sorry. please update deployment.apps/volcano-scheduler.spec.template.spec.containers.args -v=3 ->-v=5, paste predicate logs of pod/volcano-scheduler, like this:

image

waiterQ avatar Aug 11 '22 02:08 waiterQ

ok, well, i mistake your meant, sorry. please update deployment.apps/volcano-scheduler.spec.template.spec.containers.args -v=3 ->-v=5, paste predicate logs of pod/volcano-scheduler, like this:

image

i modified the args from v=3 -> v=5, but i didn't see the similar logs like you sent. i post the logs of volcano-scheduler here, hope it's helpful. thanks~

I0811 07:17:34.057947 1 scheduler.go:93] Start scheduling ... I0811 07:17:34.058001 1 node_info.go:277] set the node cn-shanghai.10.0.0.27 status to Ready. I0811 07:17:34.058081 1 node_info.go:277] set the node cn-shanghai.10.0.0.28 status to Ready. I0811 07:17:34.058149 1 node_info.go:277] set the node cn-shanghai.10.0.0.38 status to Ready. I0811 07:17:34.058299 1 node_info.go:277] set the node cn-shanghai.10.0.0.30 status to Ready. I0811 07:17:34.058386 1 cache.go:971] The priority of job <default/podgroup-a7e08473-32de-4e8f-8ab7-981c42d2e8c6> is </0> I0811 07:17:34.058424 1 cache.go:1009] There are <1> Jobs, <2> Queues and <4> Nodes in total for scheduling. I0811 07:17:34.058436 1 session.go:170] Open Session f69a982e-ad70-4826-821f-8705de034a3b with <1> Job and <2> Queues I0811 07:17:34.058466 1 overcommit.go:72] Enter overcommit plugin ... I0811 07:17:34.058476 1 overcommit.go:127] Leaving overcommit plugin. I0811 07:17:34.058495 1 drf.go:204] Total Allocatable cpu 35100.00, memory 179394097152.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00 I0811 07:17:34.060627 1 proportion.go:80] The total resource is <cpu 35100.00, memory 179394097152.00, hugepages-2Mi 0.00, hugepages-1Gi 0.00> I0811 07:17:34.060651 1 proportion.go:88] The total guarantee resource is <cpu 0.00, memory 0.00> I0811 07:17:34.060656 1 proportion.go:91] Considering Job <default/podgroup-a7e08473-32de-4e8f-8ab7-981c42d2e8c6>. I0811 07:17:34.060664 1 proportion.go:124] Added Queue attributes. I0811 07:17:34.060683 1 proportion.go:182] Considering Queue : weight <1>, total weight <1>. I0811 07:17:34.060694 1 proportion.go:196] Format queue deserved resource to <cpu 100.00, memory 0.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00> I0811 07:17:34.060706 1 proportion.go:200] queue is meet I0811 07:17:34.060714 1 proportion.go:208] The attributes of queue in proportion: deserved <cpu 100.00, memory 0.00>, realCapability <cpu 2000.00, memory 179394097152.00>, allocate <cpu 0.00, memory 0.00>, request <cpu 100.00, memory 0.00>, share <0.00> I0811 07:17:34.060728 1 proportion.go:220] Remaining resource is <cpu 35000.00, memory 179394097152.00, hugepages-2Mi 0.00, hugepages-1Gi 0.00> I0811 07:17:34.060744 1 proportion.go:171] Exiting when total weight is 0 I0811 07:17:34.064142 1 binpack.go:158] Enter binpack plugin ... I0811 07:17:34.064158 1 binpack.go:177] resources [] record in weight but not found on any node I0811 07:17:34.064166 1 binpack.go:161] Leaving binpack plugin. binpack.weight[1], binpack.cpu[1], binpack.memory[1], no extend resources. ... I0811 07:17:34.064174 1 enqueue.go:44] Enter Enqueue ... I0811 07:17:34.064180 1 enqueue.go:62] Added Queue for Job <default/podgroup-a7e08473-32de-4e8f-8ab7-981c42d2e8c6> I0811 07:17:34.064186 1 enqueue.go:73] Added Job <default/podgroup-a7e08473-32de-4e8f-8ab7-981c42d2e8c6> into Queue I0811 07:17:34.064193 1 enqueue.go:78] Try to enqueue PodGroup to 1 Queues I0811 07:17:34.064205 1 overcommit.go:114] Resource in cluster is overused, reject job <default/podgroup-a7e08473-32de-4e8f-8ab7-981c42d2e8c6> to be inqueue I0811 07:17:34.064212 1 enqueue.go:103] Leaving Enqueue ... I0811 07:17:34.064219 1 allocate.go:43] Enter Allocate ... I0811 07:17:34.064228 1 allocate.go:62] Job <default/podgroup-a7e08473-32de-4e8f-8ab7-981c42d2e8c6> Queue skip allocate, reason: job status is pending. I0811 07:17:34.064233 1 allocate.go:96] Try to allocate resource to 0 Namespaces I0811 07:17:34.064238 1 allocate.go:111] unlockedNode ID: a589e2f7-0147-486e-ba3d-3491904f241e, Name: cn-shanghai.10.0.0.27 I0811 07:17:34.064245 1 allocate.go:111] unlockedNode ID: c19bb0e6-a34a-44fa-b983-0ea2f382f34a, Name: cn-shanghai.10.0.0.28 I0811 07:17:34.064250 1 allocate.go:111] unlockedNode ID: 25fb647e-5734-43aa-9a61-6367049e200c, Name: cn-shanghai.10.0.0.38 I0811 07:17:34.064255 1 allocate.go:111] unlockedNode ID: b29ed0d2-85c1-4a51-869d-bb9e348564bb, Name: cn-shanghai.10.0.0.30 I0811 07:17:34.064262 1 allocate.go:283] Leaving Allocate ... I0811 07:17:34.064268 1 backfill.go:40] Enter Backfill ... I0811 07:17:34.064272 1 backfill.go:90] Leaving Backfill ... I0811 07:17:34.064376 1 cache.go:773] task unscheduleable default/deploy-with-volcano-d964bb946-t8s8m, message: 1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Undetermined, skip by no condition update I0811 07:17:34.064387 1 session.go:192] Close Session f69a982e-ad70-4826-821f-8705de034a3b I0811 07:17:34.064392 1 scheduler.go:112] End scheduling ...

bingsenmu avatar Aug 11 '22 07:08 bingsenmu

wip, im analysing the offered infos, sorry to touch unassign-me

waiterQ avatar Aug 12 '22 02:08 waiterQ

whats k8s version your cluster?

waiterQ avatar Aug 12 '22 03:08 waiterQ

if could, clusters node info needs also

waiterQ avatar Aug 12 '22 04:08 waiterQ

if could, clusters node info needs also

my cluster's k8s version is 1.20.11 on aliyun cloud, the node info is(Please let me know if it meets your needs, thanks~):

nodeInfo: architecture: amd64 bootID: 495b453e-5b95-4691-86e8-9bda2164f803 containerRuntimeVersion: docker://19.3.15 kernelVersion: 4.19.91-25.6.al7.x86_64 kubeProxyVersion: v1.20.11-aliyun.1 kubeletVersion: v1.20.11-aliyun.1 machineID: "xxxxx" operatingSystem: linux osImage: Alibaba Cloud Linux (Aliyun Linux) 2.1903 LTS (Hunting Beagle) systemUUID: 45be8658-0d19-466d-bd5b-1303f6805ace

bingsenmu avatar Aug 15 '22 13:08 bingsenmu

Can you help to post the volcano configuration, it seems like it was rejected by the overcommit plugin.

I0811 07:17:34.064205 1 overcommit.go:114] Resource in cluster is overused, reject job <default/podgroup-a7e08473-32de-4e8f-8ab7-981c42d2e8c6> to be inqueue

kerthcet avatar Aug 18 '22 08:08 kerthcet

Can you help to post the volcano configuration, it seems like it was rejected by the overcommit plugin.

I0811 07:17:34.064205 1 overcommit.go:114] Resource in cluster is overused, reject job <default/podgroup-a7e08473-32de-4e8f-8ab7-981c42d2e8c6> to be inqueue

is the volcano_scheduler.conf? image

bingsenmu avatar Aug 29 '22 13:08 bingsenmu

any update? I met the same issue here.

cauwulixuan avatar Aug 31 '22 04:08 cauwulixuan

log printing need to be added in pkg/scheduler/plugins/overcommit/overcommit.go klog.V(4).Infof("node(%v) Allocatable:%s, Used:%s", node.Name, node.Allocatable, node.Used) klog.V(4).Infof("idleResource:%s, total:%s, overCommitFactor:%v, used:%s", op.idleResource, total, op.overCommitFactor, used) image klog.V(4).Infof("jobMinReq:%s, idle:%s", jobMinReq, idle) image then, recompile scheduler image and replace it , show logs again.

waiterQ avatar Sep 02 '22 09:09 waiterQ

Is it solved?

jiamin13579 avatar Oct 19 '22 07:10 jiamin13579

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Jan 21 '23 10:01 stale[bot]

I met the same issue here.

leonharetd avatar Feb 27 '23 08:02 leonharetd

I made the following steps and it worked with volcano v1.7.0 kubectl get queue default -o yaml I found that queue has no information about allocated image The dispatcher reported the same error as you.

Resource in cluster is overused, reject job  or queue <xxx> is meet

allocated It is upgraded from v1.6.0, Maybe I installed the previous version and didn't completely uninstall it. Then I reinstall volcano v1.7.0 kubectl get queue default -o yaml image found allocated information. Finally it works well. I hope it will help you.

leonharetd avatar Feb 28 '23 03:02 leonharetd

I made the following steps and it worked with volcano v1.7.0 kubectl get queue default -o yaml I found that queue has no information about allocated image The dispatcher reported the same error as you.

Resource in cluster is overused, reject job  or queue <xxx> is meet

allocated It is upgraded from v1.6.0, Maybe I installed the previous version and didn't completely uninstall it. Then I reinstall volcano v1.7.0 kubectl get queue default -o yaml image found allocated information. Finally it works well. I hope it will help you.

Can you tell the workloads you use? The Resource in cluster is overused problem show-up mostly with queue's podGroup problem, check podGroup's resources can be a better way to figure out where is broken. There're another simple plan is abandon overcommit plugin (if no need rigorous queue resources limit).

For now Volcano, it still can not adjust very well with some workload which resources is always changes. hope this can help you : )

waiterQ avatar Feb 28 '23 07:02 waiterQ

I use this

https://github.com/volcano-sh/volcano/blob/79e6b749f5a7d4b77deb838632abf238b8754c66/example/task-start-dependency/mpi.yaml

leonharetd avatar Feb 28 '23 07:02 leonharetd

You can check the podgroup information of all inqueue states in the current system to see if there is a podgroup leak that occupies system resources @leonharetd

wangyang0616 avatar Mar 01 '23 03:03 wangyang0616

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Jun 10 '23 01:06 stale[bot]

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

stale[bot] avatar Aug 10 '23 01:08 stale[bot]