volcano Unexpected Behaviour for AddJobEnqueueableFn

I tried to add a AddJobEnqueueableFn to the predicates plugin and move it to the first tier of volcano-scheduler.conf in order to solve a problem where jobs were not being scheduled due to a mix of resource requests and node predicates.

The actual overview of the problem I was having was that I have some 3090 and some 1080 nodes in the cluster, and use nodeselector to assign jobs to the right node. However if there is a long queue of jobs in a queue that are trying to use 1080s, they will fill up "Inqueue" and prevent jobs in another queue from being scheduled to 3090s which should be available and get stuck in "Pending". An example below where all jobs in training are 1080 jobs, and all jobs in default are 3090 jobs, where there should be 3090 resources available.

vcctl queue list
Name                     Weight  State   Inqueue Pending Running Unknown 
default                  1       Open    0       1       1       0       
training                 1       Open    38      22      6       0

To resolve this, I wanted to stop the 1080 jobs from transitioning to the Inqueue state, and to do that, after looking at the source code, I came to the conclusion that I should be able to block the 1080 jobs from moving from Pending to Inqueue by adding an AddJobEnqueueableFn to the predicates plugin as its the GPU names and counts which I'm trying to use as a filter.

ssn.AddJobEnqueueableFn(pp.Name(), func(obj interface{}) int {
job := obj.(*api.JobInfo)

// Determine count of available GPUs per gpu type
gpu_type_count := make(map[string]int)
for _, nodeInfo := range ssn.Nodes {
	gpu_name := nodeInfo.Node.Labels["nvidia.com/gpu.product"]
	if current_count, ok := gpu_type_count[gpu_name]; ok {
		gpu_type_count[gpu_name] = current_count + len(nodeInfo.GetDevicesIdleGPUs())
	} else {
		gpu_type_count[gpu_name] = len(nodeInfo.GetDevicesIdleGPUs())
	}
}

for _, value := range job.Tasks {
	if pod_gpu_type, ok := value.Pod.Spec.NodeSelector["nvidia.com/gpu.product"]; ok {
		gpu_type_count[pod_gpu_type] -= api.GetGPUNumberOfPod(value.Pod)
	}
}

// If gpu_remain < 0 we are overcommitted
for gpu_type, gpu_remain := range gpu_type_count {
	klog.V(3).Infof("<%s> has available qty <%d>", gpu_type, gpu_remain)
	if gpu_remain < 0 {
		return util.Reject
	}
}

return util.Permit
})

I also moved predicates to the first tier as from what I can tell, only the first tier are used to check for enqueueablity. However, with this setup, everything just gets pushed into the "Inqueue" state, which I guess solves my problem with 3090 jobs not being scheduled when they can (they do start up correctly now), but I didn't expect everything to get pushed into Inqueue. And I don't think I can see my debug message <%s> has available qty <%d> getting printed, so I don't think my EnqueueableFn code is even being used. So I was wondering where I've gone wrong in my thinking/code.

  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
      - name: predicates
        arguments:
            predicate.GPUNumberEnable: true
    - plugins:
      - name: overcommit
      - name: drf
      - name: proportion
      - name: nodeorder
      - name: binpack

May 20 '22 08:05 5had3z

/cc @hwdef Can you help for that?

May 24 '22 01:05 Thor-wl

/cc @hwdef Can you help for that?

ok, I'll look into this.

May 24 '22 04:05 hwdef

Sorry for the late reply, I've been a little busy lately. I have a preliminary look at your code and I think there is probably no problem. I'll do some testing on my environment next. Regarding the problem of not seeing the log, have you tried to increase the level of the log? Another thing that puzzles me is why the two queues affect each other? They are isolated.

May 31 '22 14:05 hwdef

No worries for the late reply, my patch is working so nothing is urgent, but not for the reason I expected, which is why I created this issue. Both queues are running on the same cluster with the same pool of GPUs (8 3090, 4 1080, 3 1080Ti). What I think is happening is if the scheduler is agnostic to GPU type (and only looks at count) it will move 1080 jobs to enqueue even if there's none left (only 3090 available which contribute to the positive count) but they won't run since it doesn't match nodeSelector annoations. This stops the 3090 jobs from going to the enqueue state as the scheduler thinks all gpus have been scheduled for (1080 jobs in enqueue are blocking 3090 jobs from moving to enqueue). Although one thing that doesn't match that hypothesis is I'm not sure why 38 1080 jobs are enqueue state when there's only a total of 15 GPUs in the cluster.

May 31 '22 23:05 5had3z

I see. I think there's a recent discussion in the community that's similar to the scenario you're talking about. https://github.com/volcano-sh/volcano/pull/2227

Jun 01 '22 02:06 hwdef

Certainly, but I think maybe better intergration of the existing nodeSelector mechanisms is cleaner than another plugin and set of attributes, which is what I'm using mostly successfully.

Jun 06 '22 07:06 5had3z

Yes, you're right, let's continue to troubleshoot the issue.

Jun 06 '22 09:06 hwdef

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

Sep 08 '22 22:09 stale[bot]

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

Nov 12 '22 09:11 stale[bot]