volcano Scheduler panics with errors like "resource is not sufficient to do operation:"

What happened: The scheduler paniced and restarted several times for about half an hour, then recovered, all without any manual intervention， except for some normal tasks running automatically and completing. The logs are as follows:

igdata2-k8s-bgp25g-music75.gy.ntes fit failed: plugin NodeAffinity predicates failed node(s) didn't match Pod's node affinity/selector
I0115 10:12:11.618831       1 predicate_helper.go:75] Predicates failed for task <spark/wizard-308703-exec-303> on node <bigdata-k8s-mix-music94.gy.ntes>: task spark/wizard-308703-exec-303 on node bigdata-k8s-mix-music94.gy.ntes fit failed: plugin NodeUnschedulable predicates failed node(s) were unschedulable
I0115 10:12:11.618829       1 predicate_helper.go:75] Predicates failed for task <spark/wizard-308703-exec-303> on node <bigdata-k8s-mix-music84.gy.ntes>: task spark/wizard-308703-exec-303 on node bigdata-k8s-mix-music84.gy.ntes fit failed: plugin NodeUnschedulable predicates failed node(s) were unschedulable
I0115 10:12:11.618838       1 predicate_helper.go:75] Predicates failed for task <spark/wizard-308703-exec-303> on node <bigdata-k8s-mix-music90.gy.ntes>: task spark/wizard-308703-exec-303 on node bigdata-k8s-mix-music90.gy.ntes fit failed: plugin NodeUnschedulable predicates failed node(s) were unschedulable
I0115 10:12:11.618846       1 predicate_helper.go:75] Predicates failed for task <spark/wizard-308703-exec-303> on node <bigdata2-k8s-bgp25g-music30.gy.ntes>: task spark/wizard-308703-exec-303 on node bigdata2-k8s-bgp25g-music30.gy.ntes fit failed: plugin NodeAffinity predicates failed node(s) didn't match Pod's node affinity/selector
I0115 10:12:11.618853       1 predicate_helper.go:75] Predicates failed for task <spark/wizard-308703-exec-303> on node <bigdata2-k8s-bgp25g-music57.gy.ntes>: task spark/wizard-308703-exec-303 on node bigdata2-k8s-bgp25g-music57.gy.ntes fit failed: Insufficient cpu
I0115 10:12:11.618875       1 predicate_helper.go:75] Predicates failed for task <spark/wizard-308703-exec-303> on node <bigdata-k8s-mix-music58.gy.ntes>: task spark/wizard-308703-exec-303 on node bigdata-k8s-mix-music58.gy.ntes fit failed: plugin NodeUnschedulable predicates failed node(s) were unschedulable
I0115 10:12:11.618903       1 predicate_helper.go:75] Predicates failed for task <spark/wizard-308703-exec-303> on node <bigdata-k8s-mix-music29.gy.ntes>: task spark/wizard-308703-exec-303 on node bigdata-k8s-mix-music29.gy.ntes fit failed: plugin NodeUnschedulable predicates failed node(s) were unschedulable
I0115 10:12:11.618902       1 predicate_helper.go:75] Predicates failed for task <spark/wizard-308703-exec-303> on node <bigdata2-k8s-bgp25g-music66.gy.ntes>: task spark/wizard-308703-exec-303 on node bigdata2-k8s-bgp25g-music66.gy.ntes fit failed: plugin NodeAffinity predicates failed node(s) didn't match Pod's node affinity/selector
E0115 10:12:11.618926       1 runtime.go:77] Observed a panic: resource is not sufficient to do operation: <cpu -2600.00, memory 96698806272.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00, ephemeral-storage 1846138130432000.00> sub <cpu 0.00, memory 0.00>
goroutine 21922 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1dbd8c0?, 0xc0885a0570})
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x200000003?})
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:49 +0x75
panic({0x1dbd8c0, 0xc0885a0570})
        /usr/local/go/src/runtime/panic.go:884 +0x213
volcano.sh/volcano/pkg/scheduler/util/assert.Assert(0xaf?, {0xc0c15be000?, 0xc0a308dca8?})
        /go/src/volcano.sh/volcano/pkg/scheduler/util/assert/assert.go:33 +0x174
volcano.sh/volcano/pkg/scheduler/util/assert.Assertf(0x0, {0x222bdaf?, 0x40dc08?}, {0xc0a308dca8?, 0x7ff56a9ed4c0?, 0xd0?})
        /go/src/volcano.sh/volcano/pkg/scheduler/util/assert/assert.go:43 +0x56
volcano.sh/volcano/pkg/scheduler/api.(*Resource).Sub(0xc0c15a7980, 0xc0c00af520)
        /go/src/volcano.sh/volcano/pkg/scheduler/api/resource_info.go:246 +0x9c
volcano.sh/volcano/pkg/scheduler/api.(*NodeInfo).FutureIdle(0xc0c008c480)
        /go/src/volcano.sh/volcano/pkg/scheduler/api/node_info.go:96 +0x246
volcano.sh/volcano/pkg/scheduler/actions/allocate.(*Action).Execute.func1(0xc01e879600, 0xc0c008c480)
        /go/src/volcano.sh/volcano/pkg/scheduler/actions/allocate/allocate.go:101 +0x36
volcano.sh/volcano/pkg/scheduler/util.(*predicateHelper).PredicateNodes.func1(0x6?)
        /go/src/volcano.sh/volcano/pkg/scheduler/util/predicate_helper.go:74 +0x3ab
k8s.io/client-go/util/workqueue.ParallelizeUntil.func1()
        /go/pkg/mod/k8s.io/[email protected]/util/workqueue/parallelizer.go:90 +0x106
created by k8s.io/client-go/util/workqueue.ParallelizeUntil
        /go/pkg/mod/k8s.io/[email protected]/util/workqueue/parallelizer.go:76 +0x1ce
E0115 10:12:11.618952       1 runtime.go:77] Observed a panic: resource is not sufficient to do operation: <cpu -8510.00, memory 81269567488.00, ephemeral-storage 1924896325632000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00> sub <cpu 0.00, memory 0.00>
goroutine 21928 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1dbd8c0?, 0xc0880da720})
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x450fd6?})
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:49 +0x75
panic({0x1dbd8c0, 0xc0880da720})
        /usr/local/go/src/runtime/panic.go:884 +0x213
volcano.sh/volcano/pkg/scheduler/util/assert.Assert(0xaf?, {0xc0c11b4000?, 0xc0c1b35ca8?})
        /go/src/volcano.sh/volcano/pkg/scheduler/util/assert/assert.go:33 +0x174
volcano.sh/volcano/pkg/scheduler/util/assert.Assertf(0x0, {0x222bdaf?, 0x40dc08?}, {0xc0c1b35ca8?, 0x7ff56a9e7250?, 0xd0?})
        /go/src/volcano.sh/volcano/pkg/scheduler/util/assert/assert.go:43 +0x56
volcano.sh/volcano/pkg/scheduler/api.(*Resource).Sub(0xc0c118df40, 0xc0c00bc8a0)
        /go/src/volcano.sh/volcano/pkg/scheduler/api/resource_info.go:246 +0x9c
volcano.sh/volcano/pkg/scheduler/api.(*NodeInfo).FutureIdle(0xc0c008c540)
        /go/src/volcano.sh/volcano/pkg/scheduler/api/node_info.go:96 +0x246
volcano.sh/volcano/pkg/scheduler/actions/allocate.(*Action).Execute.func1(0xc01e879600, 0xc0c008c540)
        /go/src/volcano.sh/volcano/pkg/scheduler/actions/allocate/allocate.go:101 +0x36
volcano.sh/volcano/pkg/scheduler/util.(*predicateHelper).PredicateNodes.func1(0x9?)
        /go/src/volcano.sh/volcano/pkg/scheduler/util/predicate_helper.go:74 +0x3ab
k8s.io/client-go/util/workqueue.ParallelizeUntil.func1()
        /go/pkg/mod/k8s.io/[email protected]/util/workqueue/parallelizer.go:90 +0x106
created by k8s.io/client-go/util/workqueue.ParallelizeUntil
        /go/pkg/mod/k8s.io/[email protected]/util/workqueue/parallelizer.go:76 +0x1ce

earlier logs that may help:

I0115 10:12:07.316612       1 allocate.go:245] Binding Task <spark/wizard-308744-exec-7> to node <bigdata2-k8s-bgp25g-music58.gy.ntes>
I0115 10:12:07.316629       1 statement.go:266] After allocated Task <spark/wizard-308744-exec-7> to Node <bigdata2-k8s-bgp25g-music58.gy.ntes>: idle <cpu 36400.00, memory 227695308800.00, ephemeral-storage 1846138130432000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00>, used <cpu 90600.00, memory 304705064960.00, ephemeral-storage 5368709120000.00>, releasing <cpu 0.00, memory 0.00>
I0115 10:12:07.331873       1 allocate.go:245] Binding Task <spark/wizard-308765-exec-5> to node <bigdata2-k8s-bgp25g-music58.gy.ntes>
I0115 10:12:07.331889       1 statement.go:266] After allocated Task <spark/wizard-308765-exec-5> to Node <bigdata2-k8s-bgp25g-music58.gy.ntes>: idle <cpu 33400.00, memory 218031632384.00, ephemeral-storage 1846138130432000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00>, used <cpu 93600.00, memory 314368741376.00, ephemeral-storage 5368709120000.00>, releasing <cpu 0.00, memory 0.00>
I0115 10:12:07.335107       1 allocate.go:245] Binding Task <spark/wizard-308773-exec-2> to node <bigdata2-k8s-bgp25g-music58.gy.ntes>
I0115 10:12:07.335127       1 statement.go:266] After allocated Task <spark/wizard-308773-exec-2> to Node <bigdata2-k8s-bgp25g-music58.gy.ntes>: idle <cpu 30400.00, memory 208367955968.00, ephemeral-storage 1846138130432000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00>, used <cpu 96600.00, memory 324032417792.00, ephemeral-storage 5368709120000.00>, releasing <cpu 0.00, memory 0.00>
I0115 10:12:07.335293       1 cache.go:307] Revert assumed volumes for task spark/wizard-308773-exec-2 on node bigdata2-k8s-bgp25g-music58.gy.ntes
E0115 10:12:07.335299       1 statement.go:393] Failed to allocate task: for node bigdata2-k8s-bgp25g-music58.gy.ntes resources [cpu] are not enough to put task <spark/wizard-308773-exec-2>, idle: cpu 2400.00, memory 113610240000.00, hugepages-2Mi 0.00, ephemeral-storage 1846138130432000.00, hugepages-1Gi 0.00, req: cpu 3000.00, memory 9663676416.00
E0115 10:12:07.352554       1 node_info.go:393] "Idle resources turn into negative after allocated" nodeName="bigdata2-k8s-bgp25g-music58.gy.ntes" task="spark/wizard-308703-exec-287" resources=[cpu] idle="cpu -1600.00, memory 100322684928.00, hugepages-2Mi 0.00, ephemeral-storage 1846138130432000.00, hugepages-1Gi 0.00" req="cpu 3000.00, memory 9663676416.00"
E0115 10:12:07.357392       1 node_info.go:393] "Idle resources turn into negative after allocated" nodeName="bigdata2-k8s-bgp25g-music58.gy.ntes" task="spark/wizard-308703-exec-299" resources=[cpu] idle="cpu -4600.00, memory 90659008512.00, hugepages-2Mi 0.00, ephemeral-storage 1846138130432000.00, hugepages-1Gi 0.00" req="cpu 3000.00, memory 9663676416.00"
I0115 10:12:07.361560       1 allocate.go:245] Binding Task <spark/wizard-308824-exec-10> to node <bigdata2-k8s-bgp25g-music58.gy.ntes>
I0115 10:12:07.361578       1 statement.go:266] After allocated Task <spark/wizard-308824-exec-10> to Node <bigdata2-k8s-bgp25g-music58.gy.ntes>: idle <cpu 27400.00, memory 198704279552.00, ephemeral-storage 1846138130432000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00>, used <cpu 99600.00, memory 333696094208.00, ephemeral-storage 5368709120000.00>, releasing <cpu 0.00, memory 0.00>
I0115 10:12:07.361698       1 cache.go:307] Revert assumed volumes for task spark/wizard-308824-exec-10 on node bigdata2-k8s-bgp25g-music58.gy.ntes
E0115 10:12:07.361703       1 statement.go:393] Failed to allocate task: for node bigdata2-k8s-bgp25g-music58.gy.ntes resources [cpu] are not enough to put task <spark/wizard-308824-exec-10>, idle: cpu -4600.00, memory 90659008512.00, hugepages-2Mi 0.00, ephemeral-storage 1846138130432000.00, hugepages-1Gi 0.00, req: cpu 3000.00, memory 9663676416.00
E0115 10:12:07.363133       1 node_info.go:393] "Idle resources turn into negative after allocated" nodeName="bigdata2-k8s-bgp25g-music58.gy.ntes" task="spark/kyuubi-pandora-by-m-bi-soc-eventpage-view-stat-1d-by-foll-c31991fd-82a8-39e2-b0c1-91e516c0f535-driver" resources=[cpu] idle="cpu -5600.00, memory 87035129856.00, hugepages-2Mi 0.00, ephemeral-storage 1846138130432000.00, hugepages-1Gi 0.00" req="cpu 1000.00, memory 3623878656.00"

What you expected to happen: The scheduler should remain stable without panic

How to reproduce it (as minimally and precisely as possible): I have not find the reproduction conditions yet

Anything else we need to know?: I found this PR makes a modification to allow "used" to become negative: https://github.com/volcano-sh/volcano/pull/3006/files. But I'm not sure if it's related to it, maybe we can discuss here to make this issue clearer.

Environment:

Volcano Version: 1.8.1
Kubernetes version (use kubectl version): 1.19.3

Jan 16 '24 09:01 bysph

@Monokaix Hello, I've come across this problem, but after reviewing the code, I haven't been able to identify the root cause. Can you offer any assistance? thx~

Jan 16 '24 09:01 bysph

Yeah, the pr https://github.com/volcano-sh/volcano/pull/3006 you mentioned has made a modification to allow node idle resources become negative. Can you search "Node out of sync" in volcano scheduler log? When tasks has specfied nodeName or node cpu allocatable decreased, this case might hanpped, so you can check that.

Jan 17 '24 02:01 Monokaix

And you should find which node's idle cpu resource become negative and check what changes have happened to the node and what tasks are assigned and removed on this node.

Jan 17 '24 02:01 Monokaix

func (ni *NodeInfo) allocateIdleResource(ti *TaskInfo) {
	ok, resources := ti.Resreq.LessEqualWithResourcesName(ni.Idle, Zero)
	if ok {
		ni.Idle.sub(ti.Resreq)
		return
	}

	ni.Idle.sub(ti.Resreq)
	klog.ErrorS(nil, "Idle resources turn into negative after allocated",
		"nodeName", ni.Name, "task", klog.KObj(ti.Pod), "resources", resources, "idle", ni.Idle.String(), "req", ti.Resreq.String())
}

how about add a recover from panic and set node idle to zero resource?

Jan 17 '24 03:01 lowang-bh

Yeah, the pr #3006 you mentioned has made a modification to allow node idle resources become negative. Can you search "Node out of sync" in volcano scheduler log? When tasks has specfied nodeName or node cpu allocatable decreased, this case might hanpped, so you can check that.

Yes, I can find the following logs, but the logs do not indicate that the node (bigdata2-k8s-bgp25g-music58.gy.ntes) caused a panic. Other than these two possibilities, is it possible that the issue was caused by Volcano and kube-scheduler being used simultaneously?

Jan 17 '24 07:01 bysph

Yeah, the pr #3006 you mentioned has made a modification to allow node idle resources become negative. Can you search "Node out of sync" in volcano scheduler log? When tasks has specfied nodeName or node cpu allocatable decreased, this case might hanpped, so you can check that.

Yes, I can find the following logs, but the logs do not indicate that the node (bigdata2-k8s-bgp25g-music58.gy.ntes) caused a panic. Other than these two possibilities, is it possible that the issue was caused by Volcano and kube-scheduler being used simultaneously?

Yeah, you can check pods on that node and find after which pod scheduled the scheduler paiced and analysis the whole scheduling process of all pods on that node.

Jan 19 '24 02:01 Monokaix

func (ni *NodeInfo) allocateIdleResource(ti *TaskInfo) {
	ok, resources := ti.Resreq.LessEqualWithResourcesName(ni.Idle, Zero)
	if ok {
		ni.Idle.sub(ti.Resreq)
		return
	}

	ni.Idle.sub(ti.Resreq)
	klog.ErrorS(nil, "Idle resources turn into negative after allocated",
		"nodeName", ni.Name, "task", klog.KObj(ti.Pod), "resources", resources, "idle", ni.Idle.String(), "req", ti.Resreq.String())
}

how about add a recover from panic and set node idle to zero resource?

This will get an inconsistent node cache, which is not applicable.

Jan 19 '24 02:01 Monokaix

But we should not panic in scheduler. The origin code let the node to be outofsync and will not schedule task.

Jan 19 '24 07:01 lowang-bh

But we should not panic in scheduler. The origin code let the node to be outofsync and will not schedule task.

The outofsync state has been removed by https://github.com/volcano-sh/volcano/pull/2998 to deal with some scenarios, and we should not count on this and node or device plugin shoud recover resource report themselves.

Jan 23 '24 03:01 Monokaix

But we should not panic in scheduler. The origin code let the node to be outofsync and will not schedule task.

The outofsync state has been removed by #2998 to deal with some scenarios, and we should not count on this and node or device plugin shoud recover resource report themselves.

In our setting the outofsync (i.e. allocated > idle) node is quite common, especially when the cluster is crowded. This is because GPU/CPU breaking down are common events in a large cluster. Given "outofsync" is a common and normal event, the scheduler should never panic in such a normal path.

As the most obvious solution, the scheduler should skip the node instead of panic.

Feb 20 '24 03:02 rt849122