volcano
volcano copied to clipboard
fix Data Race leading to panic in Scheduler (and use more efficient strings comparison)
3e03d161efb603e92191ff46c092cf231c0c8803 - fix #3324 (scheduler panic in unexpected places due to Data Race):
reading operations during the modification of the var IgnoredDevicesList []string
could lead to a panic:
WARNING: DATA RACE
Read at 0x00000565d810 by goroutine 276:
volcano.sh/volcano/pkg/scheduler/api.NewResource()
/go/src/volcano.sh/volcano/pkg/scheduler/api/resource_info.go:89 +0x40a
volcano.sh/volcano/pkg/scheduler/api.(*JobInfo).GetMinResources()
/go/src/volcano.sh/volcano/pkg/scheduler/api/job_info.go:490 +0x1705
volcano.sh/volcano/pkg/scheduler/plugins/proportion.(*proportionPlugin).OnSessionOpen.func5()
/go/src/volcano.sh/volcano/pkg/scheduler/plugins/proportion/proportion.go:352 +0x16a1
volcano.sh/volcano/pkg/scheduler/framework.(*Session).JobEnqueueable()
/go/src/volcano.sh/volcano/pkg/scheduler/framework/session_plugins.go:401 +0x24a
volcano.sh/volcano/pkg/scheduler/actions/enqueue.(*Action).Execute()
/go/src/volcano.sh/volcano/pkg/scheduler/actions/enqueue/enqueue.go:95 +0xddd
volcano.sh/volcano/pkg/scheduler.(*Scheduler).runOnce()
/go/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:126 +0x478
volcano.sh/volcano/pkg/scheduler.(*Scheduler).runOnce-fm()
<autogenerated>:1 +0x39
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226 +0x48
k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227 +0xce
k8s.io/apimachinery/pkg/util/wait.JitterUntil()
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:204 +0x10d
k8s.io/apimachinery/pkg/util/wait.Until()
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:161 +0x48
volcano.sh/volcano/pkg/scheduler.(*Scheduler).Run.func2()
/go/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:95 +0x58
457d37bdb6b07ed525446658d53863605f49646a - a little more efficient string comparison:
taking into account the source code of strings.Compare
:
// Compare returns an integer comparing two strings lexicographically.
// The result will be 0 if a == b, -1 if a < b, and +1 if a > b.
//
// Compare is included only for symmetry with package bytes.
// It is usually clearer and always faster to use the built-in
// string comparison operators ==, <, >, and so on.
func Compare(a, b string) int {
// NOTE(rsc): This function does NOT call the runtime cmpstring function,
// because we do not want to provide any performance justification for
// using strings.Compare. Basically no one should use strings.Compare.
// As the comment above says, it is here only for symmetry with package bytes.
// If performance is important, the compiler should be changed to recognize
// the pattern so that all code doing three-way comparisons, not just code
// using strings.Compare, can benefit.
if a == b {
return 0
}
if a < b {
return -1
}
return +1
}
Welcome @belo4ya!
It looks like this is your first PR to volcano-sh/volcano 馃帀.
Thank you, and welcome to Volcano. :smiley:
/assign @k82cn
How about set IgnoredDevicesList
in nodeInfo
?
How about set
IgnoredDevicesList
innodeInfo
?
Hi, any progress here?
any update? i can help u finish this pr, @belo4ya
any update? i can help u finish this pr, @belo4ya
I glanced at the idea of linking IgnoredDevicesList
with NodeInfo
- and it turned out that there is already such a relationship:
ni.Others[GPUSharingDevice].(Devices).GetIgnoredDevices(),
ni.Others[vgpu.DeviceName].(Devices).GetIgnoredDevices(),
Then, for each filtering operation of ignored resources, an IgnoredDevicesList could be calculated locally:
var niList []NodeInfo
ignoredDevicesList := map[string]struct{}{}
for _, ni := range niList {
for _, device := range ni.Others[GPUSharingDevice].(Devices).GetIgnoredDevices() {
ignoredDevicesList[device] = struct{}{}
}
for _, device := range ni.Others[vgpu.DeviceName].(Devices).GetIgnoredDevices() {
ignoredDevicesList[device] = struct{}{}
}
}
But I have a sharp contradiction when I go further. I don't fully understand the current work with IgnoredDevicesList
- it doesn't seem right to me.
And at the same time, it seems to me that NodeInfo
and IgnoredDevicesList
are different levels of abstraction. IgnoredDevicesList
is calculated at the time of NodeInfo
initialization, and the further (and previous) initialization of NodeInfo
depends on IgnoredDevicesList
. And IgnoredDevicesList
is an aggregate that is useless before initializing all NodeInfo
.
That is, I mean that IgnoredDevicesList
is not an object that should exist inside NodeInfo
as an attribute. And this is an object that should be at the level with nodes []*v1.Node
It seems that such a change will require major edits and I would like to know your opinion, how do you see how this should be implemented? @Monokaix, @Vacant2333
any update? i can help u finish this pr, @belo4ya
I glanced at the idea of linking
IgnoredDevicesList
withNodeInfo
- and it turned out that there is already such a relationship:ni.Others[GPUSharingDevice].(Devices).GetIgnoredDevices(), ni.Others[vgpu.DeviceName].(Devices).GetIgnoredDevices(),
Then, for each filtering operation of ignored resources, an IgnoredDevicesList could be calculated locally:
var niList []NodeInfo ignoredDevicesList := map[string]struct{}{} for _, ni := range niList { for _, device := range ni.Others[GPUSharingDevice].(Devices).GetIgnoredDevices() { ignoredDevicesList[device] = struct{}{} } for _, device := range ni.Others[vgpu.DeviceName].(Devices).GetIgnoredDevices() { ignoredDevicesList[device] = struct{}{} } }
But I have a sharp contradiction when I go further. I don't fully understand the current work with
IgnoredDevicesList
- it doesn't seem right to me.And at the same time, it seems to me that
NodeInfo
andIgnoredDevicesList
are different levels of abstraction.IgnoredDevicesList
is calculated at the time ofNodeInfo
initialization, and the further (and previous) initialization ofNodeInfo
depends onIgnoredDevicesList
. AndIgnoredDevicesList
is an aggregate that is useless before initializing allNodeInfo
.That is, I mean that
IgnoredDevicesList
is not an object that should exist insideNodeInfo
as an attribute. And this is an object that should be at the level withnodes []*v1.Node
It seems that such a change will require major edits and I would like to know your opinion, how do you see how this should be implemented? @Monokaix, @Vacant2333
Yeah, we can keep this: )
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: william-wang
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~pkg/scheduler/OWNERS~~ [william-wang]
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment