Volcano restarts completed pods after node is garbage collected
What happened: When task pods run on multiple nodes, say node A and B. If all pods on node A completed and node A is removed by the cluster autoscaler (in my case karpenter), while some pods on node B are still running thus job is in RUNNING state, volcano lost track of the pods on node A and rescheduled those pods.
Below is history of vcctl get job showing this behavior. Succeeded decreased from 393 to 181 and Pending increased accordingly.
Timestamp: Tue Jan 9 08:50:27 CST 2024
Name Creation Phase JobType Replicas Min Pending Running Succeeded Failed Unknown RetryCount
perform-reco-tlyyiv 2024-01-09 Running Batch 574 1 132 30 389 0 23 0
--------------------------
Timestamp: Tue Jan 9 08:50:43 CST 2024
Name Creation Phase JobType Replicas Min Pending Running Succeeded Failed Unknown RetryCount
perform-reco-tlyyiv 2024-01-09 Running Batch 574 1 128 43 393 0 10 0
--------------------------
Timestamp: Tue Jan 9 08:51:00 CST 2024
Name Creation Phase JobType Replicas Min Pending Running Succeeded Failed Unknown RetryCount
perform-reco-tlyyiv 2024-01-09 Running Batch 574 1 173 53 347 0 1 0
--------------------------
Timestamp: Tue Jan 9 08:51:16 CST 2024
Name Creation Phase JobType Replicas Min Pending Running Succeeded Failed Unknown RetryCount
perform-reco-tlyyiv 2024-01-09 Running Batch 574 1 206 58 309 0 1 0
--------------------------
Timestamp: Tue Jan 9 08:51:32 CST 2024
Name Creation Phase JobType Replicas Min Pending Running Succeeded Failed Unknown RetryCount
perform-reco-tlyyiv 2024-01-09 Running Batch 574 1 259 70 245 0 0 0
--------------------------
Timestamp: Tue Jan 9 08:51:49 CST 2024
Name Creation Phase JobType Replicas Min Pending Running Succeeded Failed Unknown RetryCount
perform-reco-tlyyiv 2024-01-09 Running Batch 574 1 244 69 261 0 0 0
--------------------------
Timestamp: Tue Jan 9 08:52:05 CST 2024
Name Creation Phase JobType Replicas Min Pending Running Succeeded Failed Unknown RetryCount
perform-reco-tlyyiv 2024-01-09 Running Batch 574 1 328 65 181 0 0 0
--------------------------
What you expected to happen: Volcano shouldn't rerun succeeded pods.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
apiVersion: v1
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: gang
enablePreemptable: false
- name: conformance
- plugins:
- name: overcommit
- name: drf
enablePreemptable: false
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
kind: ConfigMap
Environment:
- Volcano Version: 1.8.1
- Kubernetes version (use
kubectl version): v1.28.4-eks-8cb36c9 - Cloud provider or hardware configuration: eks
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a): - Install tools:
- Others:
Hi, can you paste your vcjob yaml?
When node is garbage collected, pods in the node will be evicted by cloud platform or kube-controller-manager, and job controller will create new pod once delete pod event is reveiced, so it's a controller/operator mechanism here, I think if we do special processing here, it will break the rule.
are there any workarounds?
Hi, can you paste your vcjob yaml?
yeah something like this
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: test-vj-d18igu
spec:
maxRetry: 3
minAvailable: 1
minSuccess: 1000
plugins:
env: []
ssh: []
svc: []
policies: []
priorityClassName: high-priority
queue: default
schedulerName: volcano
tasks:
- maxRetry: 1
minAvailable: 1
name: test-vj
replicas: 1000
template:
metadata:
name: test-vj
spec:
containers:
- command:
- bash
- -c
- |
sleep $((RANDOM % 60 + 1))
env: []
image: python
imagePullPolicy: IfNotPresent
name: test-vj
resources:
requests:
cpu: 1
restartPolicy: Never