volcano icon indicating copy to clipboard operation
volcano copied to clipboard

quick termination of jobs with hdrf queues

Open Robert-Christensen-visa opened this issue 2 years ago • 11 comments

What happened:

I create three queues that use hdrf. Two queues share root/dev hierarchy, and one queue is the only one in root/alt. The queues are setup like this:

side-queue   root/alt/side    1/1/1
big-queue    root/dev/big     1/1/100
little-queue root/dev/little  1/1/1

12 identical jobs are submitted to each queue to saturate the resources. Each job requests 1CPU and 256Mb memory. On a local cluster this will saturate the resources so some jobs must queue while the other jobs finish. To reduce the complexity of trying to figure out what is causing what, the only plugin used is drf with enableHierarchy: true.

In the reclaim operation, jobs are terminated, especially work submitted to side-queue. On my setup, these jobs will eventually fail because they hit the maxRetry. None of the jobs are able to run for long before being terminated. I see jobs being terminated from all queues. A pods will churn between Running -> Terminating -> Pending -> Running ...

What you expected to happen:

Due to the hierarchy specified, I would expect half of the running jobs to be from side-queue and the other half to be from big-queue and little-queue. Between big-queue and little-queue, I expect most of the jobs running to be in big-queue because at the last level it has a weight of 100 while little-queue only has a weight of 1. Since reclaim is enabled, I expect some jobs might be terminated if resources need to be shifted between queues, but I would expect this to settle quickly to a steady state.

How to reproduce it (as minimally and precisely as possible):

I am running this on my local machine with 12 CPUs and 8GB memory dedicated to kubernetes.

To launch the jobs, I run the following script which creates the three queues and submits the jobs to the queue:

#!/bin/bash

# Three queues, each with a limit
cat <<EOF | kubectl apply -f -
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: side-queue
  annotations:
    "volcano.sh/hierarchy": "root/alt/side"
    "volcano.sh/hierarchy-weights": "1/1/1"
spec:
  weight: 1
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: big-queue
  annotations:
    "volcano.sh/hierarchy": "root/dev/big"
    "volcano.sh/hierarchy-weights": "1/1/100"
spec:
  weight: 1
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: little-queue
  annotations:
    "volcano.sh/hierarchy": "root/dev/little"
    "volcano.sh/hierarchy-weights": "1/1/1"
spec:
  weight: 1
EOF

job_template_file=$(mktemp)
cat <<EOF > ${job_template_file}
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: vcjob-job-<id>
spec:
  minAvailable: 1
  schedulerName: volcano
  policies:
    - event: PodEvicted
      action: RestartJob
  maxRetry: 100
  queue: <queue>
  tasks:
    - replicas: 1
      name: "x"
      template:
        metadata:
          name: core
        spec:
          containers:
            - image: ubuntu
              imagePullPolicy: IfNotPresent
              name: ubuntu
              command: ['sh', '-c', 'sleep 600000']
              resources:
                requests:
                  cpu: "1"
                  memory: "256Mi"
          restartPolicy: OnFailure
EOF

# launch jobs to each queue, to saturate the node
for id in $(seq 12)
do
    sed 's/<queue>/side-queue/g' < ${job_template_file} | \
    sed "s/<id>/side-$id/g" | \
    kubectl apply -f -
done

for id in $(seq 12)
do
    sed 's/<queue>/big-queue/g' < ${job_template_file} | \
    sed "s/<id>/big-$id/g" | \
    kubectl apply -f -
done

for id in $(seq 12)
do
    sed 's/<queue>/little-queue/g' < ${job_template_file} | \
    sed "s/<id>/little-$id/g" | \
    kubectl apply -f -
done

My volcano-scheduler.conf:

actions: "enqueue, allocate, reclaim"
tiers:
- plugins:
  - name: drf
    enableHierarchy: true

I watch the status of the jobs using something like watch kubectl get pods and notice the change between the status of the different pods trying to run in the cluster.

Anything else we need to know?:

If I setup the queues so all queues only share the root, I don't see this issue.

Environment:

  • Volcano Version: master branch (as of Wed Aug 4 17:16:34 2021 +0800)
  • Kubernetes version (use kubectl version):Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"1dd5338295409edcfff11505e7bb246f0d325d15", GitTreeState:"clean", BuildDate:"2021-01-13T13:15:20Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: I am running this on my local machine for testing
  • OS (e.g. from /etc/os-release): I am running this using Docker on my local mac.
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Robert-Christensen-visa avatar Aug 04 '21 23:08 Robert-Christensen-visa

@Robert-Christensen-visa Thanks for your reporting. I will try to reproduce the issue with your step.

william-wang avatar Aug 05 '21 01:08 william-wang

Sorry for the delay, l will try to take a look for that )

shinytang6 avatar Aug 09 '21 10:08 shinytang6

I have looked into this a little more to see what is going on. This will only be an issue if little-queue has a job running in it. What happens is in updateHierarchicalShare all the second level hierarchy are not saturated. This means to determine the resource consumption the dominant share is rescaled using mdr. Since little-queue is only consuming very few resources because of it's small weight, the resource consumption of this queue is small.

https://github.com/volcano-sh/volcano/blob/7e1e6960c61e2536f037de8567178a8e9d5f7cba/pkg/scheduler/plugins/drf/drf.go#L574-L585

The value of mdr is used to scale how much is being demanded by each of the children

https://github.com/volcano-sh/volcano/blob/7e1e6960c61e2536f037de8567178a8e9d5f7cba/pkg/scheduler/plugins/drf/drf.go#L600-L601

Because only 1 job will be running in little-queue at the root/dev level the resources will be scaled by what it running in little-queue. This will make it so the system thinks more should be running in the root/dev hierarchy because it is scaling the resources used by both little-queue and big-queue by mdr. In the reclaim operation it will recognize this and delete jobs running in the root/alt hierarchy to try to move them to root/dev hierarchy.

I think the issue here is that the weights on a given level of the hierarchy are not all the same so using the mdr value in a hierarchy with different weights is causing a problem. In the Hierarchical Scheduling for Diverse Datacenter Workloads paper it does not give clear direction for what all must be modified to get this to work with differing weights on a given level, it just says "our discussion can be generalized to multiple weights in a straightforward fashion", but often researchers like to say stuff like that when they just don't want to give additional details. I know I did when I wrote academic publications :)

This is not an issue with all the weights on the level are the same, such as what is tested in the unit tests. However, I would assume it is expected weights for the queues can be different (so that something like 80% of resources go to team A and 20% go to team B).

I just wanted to write down what I figured out before the weekend hit and I forgot everything I figured out.

Robert-Christensen-visa avatar Aug 14 '21 00:08 Robert-Christensen-visa

I found an easy why to replicate the issue.

In the unit tests, everything neatly fits in the resources provided. For example, in the following test, each request is for exactly 1CPU or 1G memory.

https://github.com/volcano-sh/volcano/blob/7e1e6960c61e2536f037de8567178a8e9d5f7cba/pkg/scheduler/plugins/drf/hdrf_test.go#L157-L159

The issue above can be recreated by slightly adjusting the amount of CPU provided by the cluster. such as by changing it to this:

nodes: []*v1.Node{util.BuildNode("n",
	util.BuildResourceList("30001m", "30G"),
	make(map[string]string))},

Because the amount of CPU available is only slightly higher, how the resources are distributed to each of the queues should be the same: Each queue requesting CPU should get a third, and each queue requesting memory should get half.

When I make this change the unit test does not pass, giving me something like this in the log:

hdrf_test.go:263: blocking nodes test: job pg4 expected resource cpu 0.00, memory 15000000000.00, nvidia.com/gpu 0.00, but got cpu 0.00, memory 10000000000.00, nvidia.com/gpu 0.00

I think this is because of the same problem as originally posted.

Robert-Christensen-visa avatar Aug 18 '21 15:08 Robert-Christensen-visa

@Robert-Christensen-visa BTW, Have you copied the incorrect yaml? The weight of big queue should be 100 instead of 1. The yaml you provided is as follows.

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: big-queue
  annotations:
    "volcano.sh/hierarchy": "root/dev/big"
    "volcano.sh/hierarchy-weights": "1/1/100"
spec:
  weight: 1

Just as what you pointed out about this paper Hierarchical Scheduling for Diverse Datacenter Workloads, I went through the whole example and the discussions are all around the same weight at same level. As what I can see, queues with different weights and at same level with the same parent should be allocated dominate resource as proportions the weight indicates. I'm trying to reproduce the phenomenon you provided. Another question is that does your cluster only contain one node with 12 cpu and 8G memory? Or more than one node?

Thor-wl avatar Nov 11 '21 02:11 Thor-wl

@Thor-wl Have you reproduced this issue in your environment?

william-wang avatar Nov 11 '21 06:11 william-wang

@Thor-wl I am testing locally to make sure I understand how it works and if it fulfills my needs before deploying on a multi-node cluster. The 12CPU and 8G memory is a single node kubernetes cluster.

If I change the weight of big-queue to 100 the results are the same. Also, I thought hdrf ignores spec["weight"] and only uses the value in metadata["annotations"]["volcano.sh/hierarchy-weights"]. In either case, setting the following for big-queue is the same as the one I provided:

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: big-queue
  annotations:
    "volcano.sh/hierarchy": "root/dev/big"
    "volcano.sh/hierarchy-weights": "1/1/100"
spec:
  weight: 100

I think the easiest way to replicate the problem is to adjust the unit test slightly.

https://github.com/volcano-sh/volcano/blob/7e1e6960c61e2536f037de8567178a8e9d5f7cba/pkg/scheduler/plugins/drf/hdrf_test.go#L157-L159

Change line 158 to util.BuildResourceList("30001m", "30G"),. The result of the test should be the same: it should evenly distribute the resources between all queues and there will be an extra "wasted" 1m CPU. However, when this is changed the unit test fails because it allocates some queues more resources than they deserve.

Robert-Christensen-visa avatar Nov 12 '21 16:11 Robert-Christensen-visa

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Feb 11 '22 05:02 stale[bot]

@Thor-wl Im facing same problem as above. Any workaround for now?

Sharathmk99 avatar May 22 '22 22:05 Sharathmk99

@Thor-wl Im facing same problem as above. Any workaround for now?

I'm sorry for not yet now. Will do that after v1.6 is released.

Thor-wl avatar May 24 '22 01:05 Thor-wl

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Sep 08 '22 22:09 stale[bot]