volcano icon indicating copy to clipboard operation
volcano copied to clipboard

Volcano restarts completed pods after node is garbage collected

Open Nan2018 opened this issue 1 year ago • 4 comments

What happened: When task pods run on multiple nodes, say node A and B. If all pods on node A completed and node A is removed by the cluster autoscaler (in my case karpenter), while some pods on node B are still running thus job is in RUNNING state, volcano lost track of the pods on node A and rescheduled those pods.

Below is history of vcctl get job showing this behavior. Succeeded decreased from 393 to 181 and Pending increased accordingly.

Timestamp: Tue Jan  9 08:50:27 CST 2024
Name                  Creation       Phase       JobType     Replicas    Min   Pending   Running   Succeeded   Failed    Unknown     RetryCount
perform-reco-tlyyiv   2024-01-09     Running     Batch       574         1     132       30        389         0         23          0         
--------------------------
Timestamp: Tue Jan  9 08:50:43 CST 2024
Name                  Creation       Phase       JobType     Replicas    Min   Pending   Running   Succeeded   Failed    Unknown     RetryCount
perform-reco-tlyyiv   2024-01-09     Running     Batch       574         1     128       43        393         0         10          0         
--------------------------
Timestamp: Tue Jan  9 08:51:00 CST 2024
Name                  Creation       Phase       JobType     Replicas    Min   Pending   Running   Succeeded   Failed    Unknown     RetryCount
perform-reco-tlyyiv   2024-01-09     Running     Batch       574         1     173       53        347         0         1           0         
--------------------------
Timestamp: Tue Jan  9 08:51:16 CST 2024
Name                  Creation       Phase       JobType     Replicas    Min   Pending   Running   Succeeded   Failed    Unknown     RetryCount
perform-reco-tlyyiv   2024-01-09     Running     Batch       574         1     206       58        309         0         1           0         
--------------------------
Timestamp: Tue Jan  9 08:51:32 CST 2024
Name                  Creation       Phase       JobType     Replicas    Min   Pending   Running   Succeeded   Failed    Unknown     RetryCount
perform-reco-tlyyiv   2024-01-09     Running     Batch       574         1     259       70        245         0         0           0         
--------------------------
Timestamp: Tue Jan  9 08:51:49 CST 2024
Name                  Creation       Phase       JobType     Replicas    Min   Pending   Running   Succeeded   Failed    Unknown     RetryCount
perform-reco-tlyyiv   2024-01-09     Running     Batch       574         1     244       69        261         0         0           0         
--------------------------
Timestamp: Tue Jan  9 08:52:05 CST 2024
Name                  Creation       Phase       JobType     Replicas    Min   Pending   Running   Succeeded   Failed    Unknown     RetryCount
perform-reco-tlyyiv   2024-01-09     Running     Batch       574         1     328       65        181         0         0           0         
--------------------------

What you expected to happen: Volcano shouldn't rerun succeeded pods.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
        enablePreemptable: false
      - name: conformance
    - plugins:
      - name: overcommit
      - name: drf
        enablePreemptable: false
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack
kind: ConfigMap

Environment:

  • Volcano Version: 1.8.1
  • Kubernetes version (use kubectl version): v1.28.4-eks-8cb36c9
  • Cloud provider or hardware configuration: eks
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Nan2018 avatar Jan 10 '24 00:01 Nan2018

Hi, can you paste your vcjob yaml?

Monokaix avatar Jan 10 '24 03:01 Monokaix

When node is garbage collected, pods in the node will be evicted by cloud platform or kube-controller-manager, and job controller will create new pod once delete pod event is reveiced, so it's a controller/operator mechanism here, I think if we do special processing here, it will break the rule.

Monokaix avatar Jan 10 '24 07:01 Monokaix

are there any workarounds?

Nan2018 avatar Jan 10 '24 17:01 Nan2018

Hi, can you paste your vcjob yaml?

yeah something like this

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: test-vj-d18igu
spec:
  maxRetry: 3
  minAvailable: 1
  minSuccess: 1000
  plugins:
    env: []
    ssh: []
    svc: []
  policies: []
  priorityClassName: high-priority
  queue: default
  schedulerName: volcano
  tasks:
  - maxRetry: 1
    minAvailable: 1
    name: test-vj
    replicas: 1000
    template:
      metadata:
        name: test-vj
      spec:
        containers:
        - command:
          - bash
          - -c
          - |
            sleep $((RANDOM % 60 + 1))
          env: []
          image: python
          imagePullPolicy: IfNotPresent
          name: test-vj
          resources:
            requests:
              cpu: 1
        restartPolicy: Never

Nan2018 avatar Jan 10 '24 21:01 Nan2018