volcano icon indicating copy to clipboard operation
volcano copied to clipboard

Extremely poor performance (client-side throttling)?

Open SebMuir-Smith opened this issue 2 years ago • 3 comments

What happened: Using a barebones install of the latest volcano, it cannot keep up with a reasonable amount of load. Submitting a relatively small number of jobs/pods (~20*100) that is easily handled by kube-scheduler causes volcano to lock up and not work properly.

The root cause seems to be job/pod admission validation webhooks timing out. Lots of logs like the below from the admission controller might be the root cause:

Waited for 19.550979481s due to client-side throttling, not priority and fairness, request: GET:https://<ip>:443/apis/scheduling.volcano.sh/v1beta1/namespaces/<namespace>/podgroups/0-test--108-3bdcc5aa-dfb7-4150-a472-deb9f3b35c99

What you expected to happen: Volcano can schedule ~ 2000+ pods, without locking up.

How to reproduce it (as minimally and precisely as possible):

  admission_resources:
    requests:
      cpu: 2000m
      memory: 8G
    limits:
      cpu: 2000m
      memory: 8G
  scheduler_resources:
    requests:
      cpu: 2000m
      memory: 8G
    limits:
      cpu: 2000m
      memory: 8G
  controller_resources:
    requests:
      cpu: 2000m
      memory: 8G
    limits:
      cpu: 2000m
      memory: 8G

Create a new job spec like:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: test
spec:
  minAvailable: 0
  schedulerName: volcano
  maxRetry: 0
  tasks:
    - replicas: 1
      minAvailable: 1
      name: "ubuntu"
      template:
        metadata:
          name: web
        spec:
          schedulerName: volcano
          containers:
            - image: ubuntu
              imagePullPolicy: IfNotPresent
              name: ubuntu
              resources:
                limits:
                  memory: "1G"
                  cpu: "0.2"
              command:
                - "sleep"
                - "604800"
          restartPolicy: OnFailure
          terminationGracePeriodSeconds: 1

Then, submit this job > 20 times (with different names) over a few minutes. Scheduler will start to lock up, not schedule the majority of the pods (or even move them to pending), and will sometimes time-out during validation webhooks.

Anything else we need to know?:

Environment:

  • Volcano Version: 1.8.2
  • Kubernetes version (use kubectl version): v1.29.0-eks-c417bb3
  • Cloud provider or hardware configuration: AWS EKS
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

SebMuir-Smith avatar Feb 06 '24 05:02 SebMuir-Smith

It looks like there are a lot of locations in the admissions validation webhooks where volcano is directly using kubernetes API gets/posts, rather than a more performant method like informers. This is likely causing the slowdown; Does volcano have any plans to migrate to informers in these areas?

SebMuir-Smith avatar Feb 06 '24 22:02 SebMuir-Smith

Confirmed that main performance issues were resolved by removing the parts of the webhooks that call the k8s api directly.

SebMuir-Smith avatar Feb 07 '24 00:02 SebMuir-Smith

Hi, you can try to increase --kube-api-qps and --kube-api-burst params of admission component to get a better performance with kube-apiserver.

Monokaix avatar Feb 19 '24 07:02 Monokaix