Extremely poor performance (client-side throttling)?
What happened:
Using a barebones install of the latest volcano, it cannot keep up with a reasonable amount of load. Submitting a relatively small number of jobs/pods (~20*100) that is easily handled by kube-scheduler causes volcano to lock up and not work properly.
The root cause seems to be job/pod admission validation webhooks timing out. Lots of logs like the below from the admission controller might be the root cause:
Waited for 19.550979481s due to client-side throttling, not priority and fairness, request: GET:https://<ip>:443/apis/scheduling.volcano.sh/v1beta1/namespaces/<namespace>/podgroups/0-test--108-3bdcc5aa-dfb7-4150-a472-deb9f3b35c99
What you expected to happen: Volcano can schedule ~ 2000+ pods, without locking up.
How to reproduce it (as minimally and precisely as possible):
admission_resources:
requests:
cpu: 2000m
memory: 8G
limits:
cpu: 2000m
memory: 8G
scheduler_resources:
requests:
cpu: 2000m
memory: 8G
limits:
cpu: 2000m
memory: 8G
controller_resources:
requests:
cpu: 2000m
memory: 8G
limits:
cpu: 2000m
memory: 8G
Create a new job spec like:
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: test
spec:
minAvailable: 0
schedulerName: volcano
maxRetry: 0
tasks:
- replicas: 1
minAvailable: 1
name: "ubuntu"
template:
metadata:
name: web
spec:
schedulerName: volcano
containers:
- image: ubuntu
imagePullPolicy: IfNotPresent
name: ubuntu
resources:
limits:
memory: "1G"
cpu: "0.2"
command:
- "sleep"
- "604800"
restartPolicy: OnFailure
terminationGracePeriodSeconds: 1
Then, submit this job > 20 times (with different names) over a few minutes. Scheduler will start to lock up, not schedule the majority of the pods (or even move them to pending), and will sometimes time-out during validation webhooks.
Anything else we need to know?:
Environment:
- Volcano Version: 1.8.2
- Kubernetes version (use
kubectl version): v1.29.0-eks-c417bb3 - Cloud provider or hardware configuration: AWS EKS
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a): - Install tools:
- Others:
It looks like there are a lot of locations in the admissions validation webhooks where volcano is directly using kubernetes API gets/posts, rather than a more performant method like informers. This is likely causing the slowdown; Does volcano have any plans to migrate to informers in these areas?
Confirmed that main performance issues were resolved by removing the parts of the webhooks that call the k8s api directly.
Hi, you can try to increase --kube-api-qps and --kube-api-burst params of admission component to get a better performance with kube-apiserver.