volcano icon indicating copy to clipboard operation
volcano copied to clipboard

when submit job, panic: resource is not sufficient to do in scheduler/plugins/proportion.go

Open nulls-cell opened this issue 3 years ago • 5 comments

What happened: When I submit the task, the volcano- Scheduler will throw "Panic: Resource is not sufficient", causing the program to exit the detail: ` panic: resource is not sufficient to do operation: <cpu 100000.00, memory 7000.00, nvidia.com/gpu 0.00> sub <cpu 120000.00, memory 3000.00>

goroutine 27 [running]: volcano.sh/volcano/pkg/scheduler/util/assert.Assert(0xc3, {0xc000150090, 0xc0007798a8}) github/volcano/pkg/scheduler/util/assert/assert.go:33 +0xfc volcano.sh/volcano/pkg/scheduler/util/assert.Assertf(0x0, {0x56293c3, 0x0}, {0xc0007798a8, 0xc0001a4008, 0x55fa6e7}) github/volcano/pkg/scheduler/util/assert/assert.go:43 +0x56 volcano.sh/volcano/pkg/scheduler/api.(*Resource).Sub(0xc00068e680, 0xc00068e6c0) github/volcano/pkg/scheduler/api/resource_info.go:232 +0x85 volcano.sh/volcano/pkg/scheduler/plugins/proportion.(*proportionPlugin).OnSessionOpen(0xc00068e100, 0xc0007241c0) github/volcano/pkg/scheduler/plugins/proportion/proportion.go:244 +0xe65 volcano.sh/volcano/pkg/scheduler/framework.OpenSession({0x575c2d8, 0xc000754480}, {0xc000302030, 0x1, 0x1}, {0x0, 0x0, 0x0}) github/volcano/pkg/scheduler/framework/framework.go:43 +0x27c volcano.sh/volcano/pkg/scheduler/plugins/proportion.TestProportionPanic.func3() github/volcano/pkg/scheduler/plugins/proportion/proportion_test.go:306 +0x15e created by volcano.sh/volcano/pkg/scheduler/plugins/proportion.TestProportionPanic github/volcano/pkg/scheduler/plugins/proportion/proportion_test.go:302 +0x2015

Process finished with the exit code 1 `

What you expected to happen: The program can run smoothly.

How to reproduce it (as minimally and precisely as possible): I tried to duplicate this error with unit tests. Then I modifie the "pkg/scheduler/plugins/proportion/proportion_test go" files, and run the unit test, similar error occurred. Below is the complete test code: https://github.com/liruisee/volcano/blob/master/pkg/scheduler/plugins/proportion/proportion_test.go

You can reproduce this problem by doing the following: 1、execute command: "git clone https://github.com/volcano-sh/volcano.git" 2、execute command: "cd volcano/pkg/scheduler/plugins/proportion", then, replace the proportion_test.go to the file: https://github.com/liruisee/volcano/blob/master/pkg/scheduler/plugins/proportion/proportion_test.go 3、execute command: "go test", the error will be throw out.

I think this problem can be explained by the table below: image the deserved can be calculate by max(min(realCapability, weight, request), guarantee), Therefore, an error must be reported in this scenario.

Anything else we need to know?:

Environment:

  • Volcano Version:
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

nulls-cell avatar Jul 19 '22 08:07 nulls-cell

/assign @hwdef

Thor-wl avatar Jul 20 '22 09:07 Thor-wl

/cc @qiankunli

Thor-wl avatar Jul 21 '22 01:07 Thor-wl

it may be that the queue2 and queue3 are calculated first, and then the remaining resources is less than 80, but the queue1' guarantee resources comes with an 80, which causes the remaining.sub(queue1.guarantee) going wrong.

according to the logic of the exist code(mainly allocate resources according to queue'weight, and take care of guarantee logic by the way), we can think that if the guarantee resource(of queue) does not exceed the weight resource(total resource * queue'weight), there will be no error. is it reasonable that we ask users to ensure that guarantee resource must be less than weight resources?

qiankunli avatar Jul 22 '22 02:07 qiankunli

I tried to list the limits of this error, which at most added up to twice the total resources of the cluster image

nulls-cell avatar Jul 22 '22 02:07 nulls-cell

"is it reasonable that we ask users to ensure that guarantee resource must be less than weight resources?"

we are not using weight actually. so all the weights in queue are the default value : 1 there are a lot cases that resources in guarantee are greater than total*weight, then later be adjusted to a much smaller amount by admin.

event though weight has to be used , it has to be a dynamic value

teou avatar Jul 22 '22 03:07 teou

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Oct 29 '22 10:10 stale[bot]

+1

nature1995 avatar Nov 04 '22 04:11 nature1995

1 scheduler.go:90] scheduler completes Initialization and start to runI1104 11:24:37.075546 1 cache.go:1106] There are <60> Jobs, <13> Queues and <8> Nodes in total for scheduling.I1104 11:24:37.075651 1 session.go:180] Open Session bfaace37-87cd-4971-963b-68512dc29ac4 with <60> Job and <13> QueuesE1104 11:24:37.079331 1 runtime.go:76] Observed a panic: resource is not sufficient to do operation: <cpu 92000.00, memory 362022715392.00, ephemeral-storage 5208405964731000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00> sub <cpu 190000.00, memory 386547056640.00>goroutine 297 [running]:k8s.io/apimachinery/pkg/util/runtime.logPanic({0x18c3120?, 0xc001498ed0}) /go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x86k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0006210b0?}) /go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x75panic({0x18c3120, 0xc001498ed0}) /usr/local/go/src/runtime/panic.go:838 +0x207volcano.sh/volcano/pkg/scheduler/util/assert.Assert(0xa8?, {0xc0004381a0?, 0xc000621298?}) /go/src/volcano.sh/volcano/pkg/scheduler/util/assert/assert.go:33 +0x109volcano.sh/volcano/pkg/scheduler/util/assert.Assertf(0x0, {0x1c913a8?, 0x7f7391260590?}, {0xc000621298?, 0x7f73b82a51d8?, 0xd0?}) /go/src/volcano.sh/volcano/pkg/scheduler/util/assert/assert.go:43 +0x56volcano.sh/volcano/pkg/scheduler/api.(*Resource).Sub(0xc0014b0100, 0xc00143be60) /go/src/volcano.sh/volcano/pkg/scheduler/api/resource_info.go:234 +0x9cvolcano.sh/volcano/pkg/scheduler/plugins/proportion.(*proportionPlugin).OnSessionOpen(0xc00143be80, 0xc000c3a000) /go/src/volcano.sh/volcano/pkg/scheduler/plugins/proportion/proportion.go:120 +0x84avolcano.sh/volcano/pkg/scheduler/framework.OpenSession({0x1eeadb0?, 0xc0002cf400?}, {0xc0006f26f0, 0x2, 0x2}, {0x0, 0x0, 0x0}) /go/src/volcano.sh/volcano/pkg/scheduler/framework/framework.go:43 +0x27fvolcano.sh/volcano/pkg/scheduler.(*Scheduler).runOnce(0xc00033da40) /go/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:111 +0x205k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x0?) /go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x3ek8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0008c02a0?, {0x1ec5c80, 0xc00081d050}, 0x1, 0xc0000de5a0) /go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0008c02a0?, 0x3b9aca00, 0x0, 0x0?, 0xc00062efd0?) /go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x89k8s.io/apimachinery/pkg/util/wait.Until(0x76ecc0?, 0xc00029fcd0?, 0xc00062efb8?) /go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x25created by volcano.sh/volcano/pkg/scheduler.(*Scheduler).Run /go/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:91 +0x1aapanic: resource is not sufficient to do operation: <cpu 92000.00, memory 362022715392.00, ephemeral-storage 5208405964731000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00> sub <cpu 190000.00, memory 386547056640.00> [recovered] panic: resource is not sufficient to do operation: <cpu 92000.00, memory 362022715392.00, ephemeral-storage 5208405964731000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00> sub <cpu 190000.00, memory 386547056640.00>goroutine 297 [running]:k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0006210b0?}) /go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0xd8panic({0x18c3120, 0xc001498ed0}) /usr/local/go/src/runtime/panic.go:838 +0x207volcano.sh/volcano/pkg/scheduler/util/assert.Assert(0xa8?, {0xc0004381a0?, 0xc000621298?}) /go/src/volcano.sh/volcano/pkg/scheduler/util/assert/assert.go:33 +0x109volcano.sh/volcano/pkg/scheduler/util/assert.Assertf(0x0, {0x1c913a8?, 0x7f7391260590?}, {0xc000621298?, 0x7f73b82a51d8?, 0xd0?}) /go/src/volcano.sh/volcano/pkg/scheduler/util/assert/assert.go:43 +0x56volcano.sh/volcano/pkg/scheduler/api.(*Resource).Sub(0xc0014b0100, 0xc00143be60) /go/src/volcano.sh/volcano/pkg/scheduler/api/resource_info.go:234 +0x9cvolcano.sh/volcano/pkg/scheduler/plugins/proportion.(*proportionPlugin).OnSessionOpen(0xc00143be80, 0xc000c3a000) /go/src/volcano.sh/volcano/pkg/scheduler/plugins/proportion/proportion.go:120 +0x84avolcano.sh/volcano/pkg/scheduler/framework.OpenSession({0x1eeadb0?, 0xc0002cf400?}, {0xc0006f26f0, 0x2, 0x2}, {0x0, 0x0, 0x0}) /go/src/volcano.sh/volcano/pkg/scheduler/framework/framework.go:43 +0x27fvolcano.sh/volcano/pkg/scheduler.(*Scheduler).runOnce(0xc00033da40) /go/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:111 +0x205k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x0?) /go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x3ek8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0008c02a0?, {0x1ec5c80, 0xc00081d050}, 0x1, 0xc0000de5a0) /go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0008c02a0?, 0x3b9aca00, 0x0, 0x0?, 0xc00062efd0?) /go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x89k8s.io/apimachinery/pkg/util/wait.Until(0x76ecc0?, 0xc00029fcd0?, 0xc00062efb8?) /go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x25created by volcano.sh/volcano/pkg/scheduler.(*Scheduler).Run /go/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:91 +0x1aa

nature1995 avatar Nov 04 '22 04:11 nature1995

您好! 这里沟通实时性比较差,方便加个微信吗?我的微信号:15210945863

At 2022-11-04 12:53:17, "Ziran Gong" @.***> wrote:

1 scheduler.go:90] scheduler completes Initialization and start to runI1104 11:24:37.075546 1 cache.go:1106] There are <60> Jobs, <13> Queues and <8> Nodes in total for scheduling.I1104 11:24:37.075651 1 session.go:180] Open Session bfaace37-87cd-4971-963b-68512dc29ac4 with <60> Job and <13> QueuesE1104 11:24:37.079331 1 runtime.go:76] Observed a panic: resource is not sufficient to do operation: <cpu 92000.00, memory 362022715392.00, ephemeral-storage 5208405964731000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00> sub <cpu 190000.00, memory 386547056640.00>goroutine 297 [running]:k8s.io/apimachinery/pkg/util/runtime.logPanic({0x18c3120?, 0xc001498ed0}) /go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x86k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0006210b0?}) /go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x75panic({0x18c3120, 0xc001498ed0}) /usr/local/go/src/runtime/panic.go:838 +0x207volcano.sh/volcano/pkg/scheduler/util/assert.Assert(0xa8?, {0xc0004381a0?, 0xc000621298?}) /go/src/volcano.sh/volcano/pkg/scheduler/util/assert/assert.go:33 +0x109volcano.sh/volcano/pkg/scheduler/util/assert.Assertf(0x0, {0x1c913a8?, 0x7f7391260590?}, {0xc000621298?, 0x7f73b82a51d8?, 0xd0?}) /go/src/volcano.sh/volcano/pkg/scheduler/util/assert/assert.go:43 +0x56volcano.sh/volcano/pkg/scheduler/api.(*Resource).Sub(0xc0014b0100, 0xc00143be60) /go/src/volcano.sh/volcano/pkg/scheduler/api/resource_info.go:234 +0x9cvolcano.sh/volcano/pkg/scheduler/plugins/proportion.(*proportionPlugin).OnSessionOpen(0xc00143be80, 0xc000c3a000) /go/src/volcano.sh/volcano/pkg/scheduler/plugins/proportion/proportion.go:120 +0x84avolcano.sh/volcano/pkg/scheduler/framework.OpenSession({0x1eeadb0?, 0xc0002cf400?}, {0xc0006f26f0, 0x2, 0x2}, {0x0, 0x0, 0x0}) /go/src/volcano.sh/volcano/pkg/scheduler/framework/framework.go:43 +0x27fvolcano.sh/volcano/pkg/scheduler.(*Scheduler).runOnce(0xc00033da40) /go/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:111 +0x205k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x0?) /go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x3ek8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0008c02a0?, {0x1ec5c80, 0xc00081d050}, 0x1, 0xc0000de5a0) /go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0008c02a0?, 0x3b9aca00, 0x0, 0x0?, 0xc00062efd0?) /go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x89k8s.io/apimachinery/pkg/util/wait.Until(0x76ecc0?, 0xc00029fcd0?, 0xc00062efb8?) /go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x25created by volcano.sh/volcano/pkg/scheduler.(*Scheduler).Run /go/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:91 +0x1aapanic: resource is not sufficient to do operation: <cpu 92000.00, memory 362022715392.00, ephemeral-storage 5208405964731000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00> sub <cpu 190000.00, memory 386547056640.00> [recovered] panic: resource is not sufficient to do operation: <cpu 92000.00, memory 362022715392.00, ephemeral-storage 5208405964731000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00> sub <cpu 190000.00, memory 386547056640.00>goroutine 297 [running]:k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0006210b0?}) /go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0xd8panic({0x18c3120, 0xc001498ed0}) /usr/local/go/src/runtime/panic.go:838 +0x207volcano.sh/volcano/pkg/scheduler/util/assert.Assert(0xa8?, {0xc0004381a0?, 0xc000621298?}) /go/src/volcano.sh/volcano/pkg/scheduler/util/assert/assert.go:33 +0x109volcano.sh/volcano/pkg/scheduler/util/assert.Assertf(0x0, {0x1c913a8?, 0x7f7391260590?}, {0xc000621298?, 0x7f73b82a51d8?, 0xd0?}) /go/src/volcano.sh/volcano/pkg/scheduler/util/assert/assert.go:43 +0x56volcano.sh/volcano/pkg/scheduler/api.(*Resource).Sub(0xc0014b0100, 0xc00143be60) /go/src/volcano.sh/volcano/pkg/scheduler/api/resource_info.go:234 +0x9cvolcano.sh/volcano/pkg/scheduler/plugins/proportion.(*proportionPlugin).OnSessionOpen(0xc00143be80, 0xc000c3a000) /go/src/volcano.sh/volcano/pkg/scheduler/plugins/proportion/proportion.go:120 +0x84avolcano.sh/volcano/pkg/scheduler/framework.OpenSession({0x1eeadb0?, 0xc0002cf400?}, {0xc0006f26f0, 0x2, 0x2}, {0x0, 0x0, 0x0}) /go/src/volcano.sh/volcano/pkg/scheduler/framework/framework.go:43 +0x27fvolcano.sh/volcano/pkg/scheduler.(*Scheduler).runOnce(0xc00033da40) /go/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:111 +0x205k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x0?) /go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x3ek8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0008c02a0?, {0x1ec5c80, 0xc00081d050}, 0x1, 0xc0000de5a0) /go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0008c02a0?, 0x3b9aca00, 0x0, 0x0?, 0xc00062efd0?) /go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x89k8s.io/apimachinery/pkg/util/wait.Until(0x76ecc0?, 0xc00029fcd0?, 0xc00062efb8?) /go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x25created by volcano.sh/volcano/pkg/scheduler.(*Scheduler).Run /go/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:91 +0x1aa

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

nulls-cell avatar Nov 04 '22 07:11 nulls-cell

#2570

zbbkeepgoing avatar Nov 17 '22 11:11 zbbkeepgoing

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Mar 23 '23 04:03 stale[bot]

Retain the current issue. /remove-lifecycle stale

wangyang0616 avatar Mar 24 '23 09:03 wangyang0616

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Aug 07 '23 05:08 stale[bot]