volcano
volcano copied to clipboard
quick termination of jobs with hdrf queues
What happened:
I create three queues that use hdrf. Two queues share root/dev
hierarchy, and one queue is the only one in root/alt
. The queues are setup like this:
side-queue root/alt/side 1/1/1
big-queue root/dev/big 1/1/100
little-queue root/dev/little 1/1/1
12 identical jobs are submitted to each queue to saturate the resources. Each job requests 1CPU and 256Mb memory. On a local cluster this will saturate the resources so some jobs must queue while the other jobs finish. To reduce the complexity of trying to figure out what is causing what, the only plugin used is drf
with enableHierarchy: true
.
In the reclaim operation, jobs are terminated, especially work submitted to side-queue
. On my setup, these jobs will eventually fail because they hit the maxRetry. None of the jobs are able to run for long before being terminated. I see jobs being terminated from all queues. A pods will churn between Running -> Terminating -> Pending -> Running ...
What you expected to happen:
Due to the hierarchy specified, I would expect half of the running jobs to be from side-queue
and the other half to be from big-queue
and little-queue
. Between big-queue
and little-queue
, I expect most of the jobs running to be in big-queue
because at the last level it has a weight of 100 while little-queue
only has a weight of 1. Since reclaim is enabled, I expect some jobs might be terminated if resources need to be shifted between queues, but I would expect this to settle quickly to a steady state.
How to reproduce it (as minimally and precisely as possible):
I am running this on my local machine with 12 CPUs and 8GB memory dedicated to kubernetes.
To launch the jobs, I run the following script which creates the three queues and submits the jobs to the queue:
#!/bin/bash
# Three queues, each with a limit
cat <<EOF | kubectl apply -f -
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: side-queue
annotations:
"volcano.sh/hierarchy": "root/alt/side"
"volcano.sh/hierarchy-weights": "1/1/1"
spec:
weight: 1
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: big-queue
annotations:
"volcano.sh/hierarchy": "root/dev/big"
"volcano.sh/hierarchy-weights": "1/1/100"
spec:
weight: 1
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: little-queue
annotations:
"volcano.sh/hierarchy": "root/dev/little"
"volcano.sh/hierarchy-weights": "1/1/1"
spec:
weight: 1
EOF
job_template_file=$(mktemp)
cat <<EOF > ${job_template_file}
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: vcjob-job-<id>
spec:
minAvailable: 1
schedulerName: volcano
policies:
- event: PodEvicted
action: RestartJob
maxRetry: 100
queue: <queue>
tasks:
- replicas: 1
name: "x"
template:
metadata:
name: core
spec:
containers:
- image: ubuntu
imagePullPolicy: IfNotPresent
name: ubuntu
command: ['sh', '-c', 'sleep 600000']
resources:
requests:
cpu: "1"
memory: "256Mi"
restartPolicy: OnFailure
EOF
# launch jobs to each queue, to saturate the node
for id in $(seq 12)
do
sed 's/<queue>/side-queue/g' < ${job_template_file} | \
sed "s/<id>/side-$id/g" | \
kubectl apply -f -
done
for id in $(seq 12)
do
sed 's/<queue>/big-queue/g' < ${job_template_file} | \
sed "s/<id>/big-$id/g" | \
kubectl apply -f -
done
for id in $(seq 12)
do
sed 's/<queue>/little-queue/g' < ${job_template_file} | \
sed "s/<id>/little-$id/g" | \
kubectl apply -f -
done
My volcano-scheduler.conf
:
actions: "enqueue, allocate, reclaim"
tiers:
- plugins:
- name: drf
enableHierarchy: true
I watch the status of the jobs using something like watch kubectl get pods
and notice the change between the status of the different pods trying to run in the cluster.
Anything else we need to know?:
If I setup the queues so all queues only share the root, I don't see this issue.
Environment:
- Volcano Version: master branch (as of Wed Aug 4 17:16:34 2021 +0800)
- Kubernetes version (use
kubectl version
):Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"1dd5338295409edcfff11505e7bb246f0d325d15", GitTreeState:"clean", BuildDate:"2021-01-13T13:15:20Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"} - Cloud provider or hardware configuration: I am running this on my local machine for testing
- OS (e.g. from /etc/os-release): I am running this using Docker on my local mac.
- Kernel (e.g.
uname -a
): - Install tools:
- Others:
@Robert-Christensen-visa Thanks for your reporting. I will try to reproduce the issue with your step.
Sorry for the delay, l will try to take a look for that )
I have looked into this a little more to see what is going on. This will only be an issue if little-queue
has a job running in it. What happens is in updateHierarchicalShare
all the second level hierarchy are not saturated. This means to determine the resource consumption the dominant share is rescaled using mdr
. Since little-queue
is only consuming very few resources because of it's small weight, the resource consumption of this queue is small.
https://github.com/volcano-sh/volcano/blob/7e1e6960c61e2536f037de8567178a8e9d5f7cba/pkg/scheduler/plugins/drf/drf.go#L574-L585
The value of mdr
is used to scale how much is being demanded by each of the children
https://github.com/volcano-sh/volcano/blob/7e1e6960c61e2536f037de8567178a8e9d5f7cba/pkg/scheduler/plugins/drf/drf.go#L600-L601
Because only 1 job will be running in little-queue
at the root/dev
level the resources will be scaled by what it running in little-queue
. This will make it so the system thinks more should be running in the root/dev
hierarchy because it is scaling the resources used by both little-queue
and big-queue
by mdr
. In the reclaim operation it will recognize this and delete jobs running in the root/alt
hierarchy to try to move them to root/dev
hierarchy.
I think the issue here is that the weights on a given level of the hierarchy are not all the same so using the mdr
value in a hierarchy with different weights is causing a problem. In the Hierarchical Scheduling for Diverse Datacenter Workloads paper it does not give clear direction for what all must be modified to get this to work with differing weights on a given level, it just says "our discussion can be generalized to multiple weights in a straightforward fashion", but often researchers like to say stuff like that when they just don't want to give additional details. I know I did when I wrote academic publications :)
This is not an issue with all the weights on the level are the same, such as what is tested in the unit tests. However, I would assume it is expected weights for the queues can be different (so that something like 80% of resources go to team A and 20% go to team B).
I just wanted to write down what I figured out before the weekend hit and I forgot everything I figured out.
I found an easy why to replicate the issue.
In the unit tests, everything neatly fits in the resources provided. For example, in the following test, each request is for exactly 1CPU or 1G memory.
https://github.com/volcano-sh/volcano/blob/7e1e6960c61e2536f037de8567178a8e9d5f7cba/pkg/scheduler/plugins/drf/hdrf_test.go#L157-L159
The issue above can be recreated by slightly adjusting the amount of CPU provided by the cluster. such as by changing it to this:
nodes: []*v1.Node{util.BuildNode("n",
util.BuildResourceList("30001m", "30G"),
make(map[string]string))},
Because the amount of CPU available is only slightly higher, how the resources are distributed to each of the queues should be the same: Each queue requesting CPU should get a third, and each queue requesting memory should get half.
When I make this change the unit test does not pass, giving me something like this in the log:
hdrf_test.go:263: blocking nodes test: job pg4 expected resource cpu 0.00, memory 15000000000.00, nvidia.com/gpu 0.00, but got cpu 0.00, memory 10000000000.00, nvidia.com/gpu 0.00
I think this is because of the same problem as originally posted.
@Robert-Christensen-visa BTW, Have you copied the incorrect yaml? The weight of big queue
should be 100 instead of 1. The yaml you provided is as follows.
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: big-queue
annotations:
"volcano.sh/hierarchy": "root/dev/big"
"volcano.sh/hierarchy-weights": "1/1/100"
spec:
weight: 1
Just as what you pointed out about this paper Hierarchical Scheduling for Diverse Datacenter Workloads, I went through the whole example and the discussions are all around the same weight at same level. As what I can see, queues with different weights and at same level with the same parent should be allocated dominate resource as proportions the weight indicates. I'm trying to reproduce the phenomenon you provided. Another question is that does your cluster only contain one node with 12 cpu and 8G memory? Or more than one node?
@Thor-wl Have you reproduced this issue in your environment?
@Thor-wl I am testing locally to make sure I understand how it works and if it fulfills my needs before deploying on a multi-node cluster. The 12CPU and 8G memory is a single node kubernetes cluster.
If I change the weight of big-queue
to 100 the results are the same. Also, I thought hdrf
ignores spec["weight"]
and only uses the value in metadata["annotations"]["volcano.sh/hierarchy-weights"]
. In either case, setting the following for big-queue
is the same as the one I provided:
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: big-queue
annotations:
"volcano.sh/hierarchy": "root/dev/big"
"volcano.sh/hierarchy-weights": "1/1/100"
spec:
weight: 100
I think the easiest way to replicate the problem is to adjust the unit test slightly.
https://github.com/volcano-sh/volcano/blob/7e1e6960c61e2536f037de8567178a8e9d5f7cba/pkg/scheduler/plugins/drf/hdrf_test.go#L157-L159
Change line 158 to util.BuildResourceList("30001m", "30G"),
. The result of the test should be the same: it should evenly distribute the resources between all queues and there will be an extra "wasted" 1m CPU. However, when this is changed the unit test fails because it allocates some queues more resources than they deserve.
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
@Thor-wl Im facing same problem as above. Any workaround for now?
@Thor-wl Im facing same problem as above. Any workaround for now?
I'm sorry for not yet now. Will do that after v1.6 is released.
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).