volcano
volcano copied to clipboard
Volcano Torchx job scaling issue on GKE with
We are trying to use Volcano to run Torchx jobs in a GKE cluster, but none of the jobs are triggering a scale up on our GPU nodepools.
If we manually scale up the targeted GPU nodepool the jobs get scheduled as desired, but we can't afford to keep the nodes up all the time so we want the volcano jobs to be able to trigger the automatic scale up.
The Defaut Queue
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
generation: 11
name: default
status:
allocated:
cpu: '0'
memory: '0'
pending: 1
reservation: {}
state: Open
spec:
capability:
cpu: '500'
memory: 600Gi
nvidia.com/gpu: '40'
guarantee: {}
reclaimable: true
weight: 1
The volcano job
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: myvcjob-jl7rjr9wqc35x
namespace: myapp-2681
status:
conditions:
- lastTransitionTime: '2024-01-23T19:59:03Z'
status: Pending
- lastTransitionTime: '2024-01-23T20:00:56Z'
status: Running
- lastTransitionTime: '2024-01-23T20:27:44Z'
status: Failed
minAvailable: 1
runningDuration: 42m13.924661549s
state:
lastTransitionTime: '2024-01-23T20:27:44Z'
phase: Failed
version: 4
spec:
maxRetry: 3
minAvailable: 1
plugins:
env: []
svc:
- '--publish-not-ready-addresses'
queue: default
schedulerName: volcano
tasks:
- maxRetry: 3
minAvailable: 1
name: myvcjob-0
replicas: 1
template:
metadata:
annotations:
sidecar.istio.io/inject: 'false'
labels:
app.kubernetes.io/instance: myvcjob-jl7rjr9wqc35x
app.kubernetes.io/managed-by: torchx.pytorch.org
app.kubernetes.io/name: app
beta.kubernetes.io/instance-type: g4-standard-48
cloud.google.com/gke-gpu: 'true'
cloud.google.com/gke-spot: 'false'
node.kubernetes.io/instance-type: g4-standard-48
nvidia.com/gpu: present
provisioning-model: on-demand
sku: L4-Quad
torchx.pytorch.org/app-name: app
torchx.pytorch.org/replica-id: '0'
torchx.pytorch.org/role-index: '0'
torchx.pytorch.org/role-name: app
torchx.pytorch.org/version: 0.6.0
volcano.sh/gpu-memory: '40000'
spec:
affinity: {}
containers:
- command:
- bash
- '-c'
- >-
newrelic-admin run-program torchrun --rdzv_backend c10d
--rdzv_endpoint localhost:0 --rdzv_id
'myvcjob-jl7rjr9wqc35x' --nnodes 1 --nproc_per_node 4
--tee 3 --role '' -m
myapp.app.components.app --job_id
f5b3d845-e111-49bf-92fd-c935e50fa0da
env:
- name: TORCHX_TRACKING_EXPERIMENT_NAME
value: default-experiment
- name: LOGLEVEL
value: WARNING
- name: TORCHX_JOB_ID
value: kubernetes://torchx/myvcjob-jl7rjr9wqc35x
- name: TORCHX_RANK0_HOST
value: localhost
image: >-
myimage:mytag
name: myvcjob-0
ports:
- containerPort: 29500
name: c10d
protocol: TCP
resources:
limits:
cpu: '44'
memory: 45G
nvidia.com/gpu: '4'
requests:
cpu: 43900m
memory: 43976M
nvidia.com/gpu: '4'
securityContext: {}
volumeMounts:
- mountPath: /dev/shm
name: dshm
restartPolicy: Never
tolerations:
- effect: NoSchedule
key: sku
operator: Equal
value: L4-Quad
- effect: NoSchedule
key: cloud.google.com/gke-spot
operator: Equal
value: 'false'
- effect: NoSchedule
key: nvidia.com/gpu
operator: Equal
value: present
volumes:
- emptyDir:
medium: Memory
name: dshm
Scheduler Config
actions: "enqueue,allocate,reclaim,backfill"
tiers:
- plugins:
- name: priority
- name: gang
enablePreemptable: false
- name: conformance
- plugins:
- name: overcommit
- name: drf
enablePreemptable: false
- name: predicates
arguments:
predicate.VGPUEnable: true
- name: proportion
- name: nodeorder
- name: binpack
This is just hapennig with volcano jobs, any other pod that we create can trigger the scale up without issues
Hi, can you paste pod yaml output? I saw that job entered a failed state.
- lastTransitionTime: '2024-01-23T20:27:44Z'
status: Failed