volcano icon indicating copy to clipboard operation
volcano copied to clipboard

Volcano Torchx job scaling issue on GKE with

Open obeyda opened this issue 2 years ago • 1 comments

We are trying to use Volcano to run Torchx jobs in a GKE cluster, but none of the jobs are triggering a scale up on our GPU nodepools.

If we manually scale up the targeted GPU nodepool the jobs get scheduled as desired, but we can't afford to keep the nodes up all the time so we want the volcano jobs to be able to trigger the automatic scale up.

The Defaut Queue
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  generation: 11
  name: default
status:
  allocated:
    cpu: '0'
    memory: '0'
  pending: 1
  reservation: {}
  state: Open
spec:
  capability:
    cpu: '500'
    memory: 600Gi
    nvidia.com/gpu: '40'
  guarantee: {}
  reclaimable: true
  weight: 1

The volcano job
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: myvcjob-jl7rjr9wqc35x
  namespace: myapp-2681
status:
  conditions:
    - lastTransitionTime: '2024-01-23T19:59:03Z'
      status: Pending
    - lastTransitionTime: '2024-01-23T20:00:56Z'
      status: Running
    - lastTransitionTime: '2024-01-23T20:27:44Z'
      status: Failed
  minAvailable: 1
  runningDuration: 42m13.924661549s
  state:
    lastTransitionTime: '2024-01-23T20:27:44Z'
    phase: Failed
  version: 4
spec:
  maxRetry: 3
  minAvailable: 1
  plugins:
    env: []
    svc:
      - '--publish-not-ready-addresses'
  queue: default
  schedulerName: volcano
  tasks:
    - maxRetry: 3
      minAvailable: 1
      name: myvcjob-0
      replicas: 1
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: 'false'
          labels:
            app.kubernetes.io/instance: myvcjob-jl7rjr9wqc35x
            app.kubernetes.io/managed-by: torchx.pytorch.org
            app.kubernetes.io/name: app
            beta.kubernetes.io/instance-type: g4-standard-48
            cloud.google.com/gke-gpu: 'true'
            cloud.google.com/gke-spot: 'false'
            node.kubernetes.io/instance-type: g4-standard-48
            nvidia.com/gpu: present
            provisioning-model: on-demand
            sku: L4-Quad
            torchx.pytorch.org/app-name: app
            torchx.pytorch.org/replica-id: '0'
            torchx.pytorch.org/role-index: '0'
            torchx.pytorch.org/role-name: app
            torchx.pytorch.org/version: 0.6.0
            volcano.sh/gpu-memory: '40000'
        spec:
          affinity: {}
          containers:
            - command:
                - bash
                - '-c'
                - >-
                  newrelic-admin run-program torchrun --rdzv_backend c10d
                  --rdzv_endpoint localhost:0 --rdzv_id
                  'myvcjob-jl7rjr9wqc35x' --nnodes 1 --nproc_per_node 4
                  --tee 3 --role '' -m
                  myapp.app.components.app --job_id
                  f5b3d845-e111-49bf-92fd-c935e50fa0da
              env:
                - name: TORCHX_TRACKING_EXPERIMENT_NAME
                  value: default-experiment
                - name: LOGLEVEL
                  value: WARNING
                - name: TORCHX_JOB_ID
                  value: kubernetes://torchx/myvcjob-jl7rjr9wqc35x
                - name: TORCHX_RANK0_HOST
                  value: localhost
              image: >-
                myimage:mytag
              name: myvcjob-0
              ports:
                - containerPort: 29500
                  name: c10d
                  protocol: TCP
              resources:
                limits:
                  cpu: '44'
                  memory: 45G
                  nvidia.com/gpu: '4'
                requests:
                  cpu: 43900m
                  memory: 43976M
                  nvidia.com/gpu: '4'
              securityContext: {}
              volumeMounts:
                - mountPath: /dev/shm
                  name: dshm
          restartPolicy: Never
          tolerations:
            - effect: NoSchedule
              key: sku
              operator: Equal
              value: L4-Quad
            - effect: NoSchedule
              key: cloud.google.com/gke-spot
              operator: Equal
              value: 'false'
            - effect: NoSchedule
              key: nvidia.com/gpu
              operator: Equal
              value: present
          volumes:
            - emptyDir:
                medium: Memory
              name: dshm

Scheduler Config
actions: "enqueue,allocate,reclaim,backfill"
tiers:
- plugins:
  - name: priority
  - name: gang
    enablePreemptable: false
  - name: conformance
- plugins:
  - name: overcommit
  - name: drf
    enablePreemptable: false
  - name: predicates
    arguments:
      predicate.VGPUEnable: true
  - name: proportion
  - name: nodeorder
  - name: binpack

This is just hapennig with volcano jobs, any other pod that we create can trigger the scale up without issues

obeyda avatar Jan 24 '24 13:01 obeyda

Hi, can you paste pod yaml output? I saw that job entered a failed state.

    - lastTransitionTime: '2024-01-23T20:27:44Z'
      status: Failed

Monokaix avatar Jan 25 '24 02:01 Monokaix