volcano icon indicating copy to clipboard operation
volcano copied to clipboard

Volcano scheduler repeatedly adds and removes volume.kubernetes.io/selected-node annotation

Open rooty0 opened this issue 7 months ago • 14 comments

Hi,

I'm running into a strange issue with the Volcano scheduler 1.12.1.

I have a PVC that's been stuck in Pending while trying to provision a local LVM volume. That part is expected and I've identified the root cause.

However, what's unexpected is that the scheduler appears to be stuck in a loop: it keeps adding the volume.kubernetes.io/selected-node annotation with the correct node name, waits about 2 seconds, removes the annotation, waits around 8 seconds, adds the annotation again, and then repeats the process over and over again.

Is this behavior expected?

Thanks

rooty0 avatar Jun 12 '25 03:06 rooty0

@rooty0 Hi, this is the expected behavior. Before binding the pod to the node, the scheduler will first check whether the PV is created and whether the PVC is successfully bound to the PV. This is done by the pv-controller. Since the PVC binding failed in your scenario, the pv-controller will remove the annotation volume.kubernetes.io/selected-node. After the scheduler checks that the volume is not bound, the pod actually fails to be scheduled. Then the scheduler will try to reschedule the pod again. The annotation volume.kubernetes.io/selected-node is updated by the scheduler before checking whether the PVC is bound, therefore every time the binding fails, there will be a removal and update annotation process. Only when the pv.kubernetes.io/bind-completed annotation is updated does it indicate that the PVC is successfully bound.

JesseStutler avatar Jun 12 '25 13:06 JesseStutler

Thanks so much for the detailed explanation, @JesseStutler !

Just to clarify - the PVC is currently in the Pending state. So you're saying it's the Persistent Volume Controller that's actually removing the annotation in this case? That's really helpful to know, thank you!

I had assumed it was the scheduler removing the volume.kubernetes.io/selected-node annotation about 1-2 seconds after setting it, which technically could be confusing for the CSI driver

rooty0 avatar Jun 12 '25 17:06 rooty0

I just checked the logs and it turns out it's actually the CIS controller that deletes the annotation - not the PV controller. You were probably referring to the CIS controller earlier, but I missed that and thought you meant the core ControllerManager with the PV controller.

Quick follow-up: is it expected for the Volcano scheduler to eventually stop setting the annotation? Starting from yesterday, I saw it repeatedly adding the annotation after it gets removed, but today there haven’t been any changes. The Volcano job is still pending, the PVC is still pending, but the scheduler no longer sets the annotation. Just wondering if that's normal behavior.

rooty0 avatar Jun 12 '25 20:06 rooty0

You can see here: https://github.com/kubernetes/kubernetes/blob/ad4cc125c92246b756186edcded475873fca796f/pkg/controller/volume/persistentvolume/pv_controller.go#L1865-L1885. If the pv controller fails to provision the volume, the Annotation of AnnSelectedNode will be deleted. You can pay attention to whether the Pod has been scheduled. If the Pod has been scheduled, the pvc annotation will not be removed. @rooty0

JesseStutler avatar Jun 13 '25 02:06 JesseStutler

Regarding the reason for eventually stop setting the annotation, it may be that the pod has been scheduled. What kind of mode did you use, WaitForFirstConsumer or Immediate ? I also want to know if the pod is still pending but has been scheduled?

JesseStutler avatar Jun 13 '25 02:06 JesseStutler

If the pv controller fails to provision the volume, the Annotation of AnnSelectedNode will be deleted.

So, correct me if I'm wrong - but the reason I believe the annotation is being removed by the CSI driver and not the core PV controller (based on the code you linked) is because of what I see in the audit logs: Image

These are all the logs I'm getting for the PVC object that's currently stuck in Pending. From what I can tell, the annotation is initially set by serviceaccount:volcano-system:volcano-scheduler, and then it's removed by system:serviceaccount:openebs:openebs-lvm-controller-sa.

I haven't yet tracked down exactly where this happens in the OpenEBS CSI provisioner codebase, but the Helm deployment for OpenEBS CSI installs the following RBAC (openebs-lvm-provisioner-role):

- apiGroups: [""]
  resources: ["persistentvolumeclaims"]
  verbs: ["get", "list", "watch", "update"]
- apiGroups: [""]
  resources: ["persistentvolumeclaims/status"]
  verbs: ["update", "patch"]

So this kind of supports my idea that it's performing an update (not patch) on the PVC, and in doing so, it seems like it ends up overwriting the whole part of the object - which removes the volume.kubernetes.io/selected-node annotation.


When the Volcano scheduler stops updating the volume.kubernetes.io/selected-node annotation, I can confidently say that none of the following conditions have changed:

  • The Volcano Job is still Pending
  • It only has a single Pod, and that Pod is still Pending
  • One of the Pod’s PVCs is still Pending (out of 4 total - the other 3 are already Bound)

The problematic PVC is using a storage class backed by the local.csi.openebs.io provisioner with WaitForFirstConsumer binding mode.

Here's an event graph that was showing the annotation being repeatedly set and removed on the PVC. As of now, nothing has changed in terms of the pending resources - but the annotation activity has stopped: Image As well as the following error:

E0612 15:31:05.387757 1 cache.go:1292] execute preBind failed: binding volumes: provisioning failed for PVC \"test-default0-0-data\", resync the task

Just to clarify - I do understand why the PVC is stuck in Pending. What I'm trying to figure out is why the Volcano scheduler stops setting the volume.kubernetes.io/selected-node annotation and stops reporting the error above.

I'm not completely sure, but if I remember correctly, we've been seeing this behavior since Volcano 1.10, maybe even starting with 1.9.

rooty0 avatar Jun 13 '25 22:06 rooty0

The only difference I’ve noticed in the volcano scheduler logs is:

  1. Before the scheduler stopped setting the annotation:
I0612 15:30:41.946518       1 session_plugins.go:557] JobOrderFn:using creationTimestamp to order job priority notebook-testing-407ba657-aa1b-4e42-b963-ca6f11a67a0a: 2025-05-05 17:48:23 +0000 UTC -- test-a895933d-d863-4dfe-90c9-4dfa10d056d7: 2025-05-14 16:49:06 +0000 UTC
...
I0612 15:30:41.947869       1 allocate.go:182] Try to allocate resource to 1 tasks of Job <mltraining-dev/test-a895933d-d863-4dfe-90c9-4dfa10d056d7>
"I0612 15:30:41.947880       1 allocate.go:376] There are <216> nodes for Job <mltraining-dev/test-a895933d-d863-4dfe-90c9-4dfa10d056d7>
I0612 15:30:41.949051       1 allocate.go:551] Binding Task <mltraining-dev/test-default0-0> to node <ip-10-4-16-2.us-west-2.compute.internal>
I0612 15:30:41.949067       1 statement.go:284] After allocated Task <mltraining-dev/test-default0-0> to Node <ip-10-4-16-2.us-west-2.compute.internal>: idle <cpu 125680.00, memory 1169155264512.00, hugepages-2Mi 44293947392000.00, attachable-volumes-csi-local.csi.openebs.io 2147483643.00, nvidia.com/gpu 0.00, hugepages-1Gi 0.00, vpc.amazonaws.com/efa 0.00, attachable-volumes-csi-ebs.csi.aws.com 127.00, attachable-volumes-csi-fsx.csi.aws.com 2147483647.00, ephemeral-storage 493695146804000.00, pods 181.00>, used <cpu 65770.00, memory 931342057472.00, pods 17.00, nvidia.com/gpu 8000.00, vpc.amazonaws.com/efa 16000.00, attachable-volumes-csi-local.csi.openebs.io 4.00>, releasing <cpu 0.00, memory 0.00>
...
I0612 15:30:41.949156       1 allocate.go:443] \"Job ready, return statement\" jobName=\"mltraining-dev/test-a895933d-d863-4dfe-90c9-4dfa10d056d7\"
...
E0612 15:30:44.985646       1 cache.go:1292] execute preBind failed: binding volumes: provisioning failed for PVC \"test-default0-0-data\", resync the task
...
I0612 15:30:52.095041       1 session_plugins.go:557] JobOrderFn:using creationTimestamp to order job priority test-a895933d-d863-4dfe-90c9-4dfa10d056d7: 2025-05-14 16:49:06 +0000 UTC -- notebook-testing-407ba657-aa1b-4e42-b963-ca6f11a67a0a: 2025-05-05 17:48:23 +0000 UTC
...
I0612 15:30:52.096542       1 allocate.go:182] Try to allocate resource to 1 tasks of Job <mltraining-dev/test-a895933d-d863-4dfe-90c9-4dfa10d056d7>
I0612 15:30:52.096554       1 allocate.go:376] There are <218> nodes for Job <mltraining-dev/test-a895933d-d863-4dfe-90c9-4dfa10d056d7>
I0612 15:30:52.098115       1 allocate.go:551] Binding Task <mltraining-dev/test-default0-0> to node <ip-10-4-16-2.us-west-2.compute.internal>
I0612 15:30:52.098136       1 statement.go:284] After allocated Task <mltraining-dev/test-default0-0> to Node <ip-10-4-16-2.us-west-2.compute.internal>: idle <cpu 125680.00, memory 1169155264512.00, nvidia.com/gpu 0.00, attachable-volumes-csi-ebs.csi.aws.com 127.00, attachable-volumes-csi-fsx.csi.aws.com 2147483647.00, ephemeral-storage 493695146804000.00, hugepages-2Mi 44293947392000.00, pods 181.00, vpc.amazonaws.com/efa 0.00, attachable-volumes-csi-local.csi.openebs.io 2147483643.00, hugepages-1Gi 0.00>, used <cpu 65770.00, memory 931342057472.00, pods 17.00, vpc.amazonaws.com/efa 16000.00, attachable-volumes-csi-local.csi.openebs.io 4.00, nvidia.com/gpu 8000.00>, releasing <cpu 0.00, memory 0.00>
...
I0612 15:30:52.098242       1 allocate.go:443] \"Job ready, return statement\" jobName=\"mltraining-dev/test-a895933d-d863-4dfe-90c9-4dfa10d056d7\"
...
E0612 15:30:55.184803       1 cache.go:1292] execute preBind failed: binding volumes: provisioning failed for PVC \"test-default0-0-data\", resync the task

... repeats all over again ...
  1. After the scheduler stopped setting the annotation:
I0612 15:31:12.507372       1 session_plugins.go:557] JobOrderFn:using creationTimestamp to order job priority test-a895933d-d863-4dfe-90c9-4dfa10d056d7: 2025-05-14 16:49:06 +0000 UTC -- notebook-testing-407ba657-aa1b-4e42-b963-ca6f11a67a0a: 2025-05-05 7:48:23 +0000 UTC
...
I0612 15:31:12.508954       1 allocate.go:182] Try to allocate resource to 1 tasks of Job <mltraining-dev/test-a895933d-d863-4dfe-90c9-4dfa10d056d7>
I0612 15:31:12.508966       1 allocate.go:376] There are <218> nodes for Job <mltraining-dev/test-a895933d-d863-4dfe-90c9-4dfa10d056d7>
...
I0612 15:31:12.519554       1 preempt.go:260] Considering Task <mltraining-dev/test-default0-0> on Node <ip-10-4-16-2.us-west-2.compute.internal>.
...
I0612 15:31:12.519586       1 preempt.go:275] No validated victims on Node <ip-10-4-16-2.us-west-2.compute.internal>: not enough resources: requested <cpu 64000.00, memory 927712935936.00, vpc.amazonaws.com/efa 16000.00, pods 1.00, ttachable-volumes-csi-local.csi.openebs.io 4.00, nvidia.com/gpu 8000.00>, but future idle <cpu 101680.00, memory 1130500558848.00, nvidia.com/gpu 0.00, vpc.amazonaws.com/efa 0.00, ephemeral-storage 493695146804000.00, pods 181.00, ttachable-volumes-csi-ebs.csi.aws.com 127.00, hugepages-1Gi 0.00, attachable-volumes-csi-local.csi.openebs.io 2147483646.00, attachable-volumes-csi-fsx.csi.aws.com 2147483647.00, hugepages-2Mi 44293947392000.00>
...
I0612 15:31:12.521859       1 preempt.go:260] Considering Task <mltraining-dev/test-default0-0> on Node <ip-10-4-16-2.us-west-2.compute.internal>.
...
I0612 15:31:12.521897       1 preempt.go:275] No validated victims on Node <ip-10-4-16-2.us-west-2.compute.internal>: not enough resources: requested <cpu 64000.00, memory 927712935936.00, attachable-volumes-csi-local.csi.openebs.io 4.00, nvidia.com/gpu 000.00, vpc.amazonaws.com/efa 16000.00, pods 1.00>, but future idle <cpu 101680.00, memory 1130500558848.00, nvidia.com/gpu 0.00, hugepages-2Mi 44293947392000.00, attachable-volumes-csi-ebs.csi.aws.com 127.00, vpc.amazonaws.com/efa 0.00, ttachable-volumes-csi-fsx.csi.aws.com 2147483647.00, hugepages-1Gi 0.00, attachable-volumes-csi-local.csi.openebs.io 2147483646.00, ephemeral-storage 493695146804000.00, pods 181.00>
...
I0612 15:31:12.524059       1 preempt.go:260] Considering Task <mltraining-dev/test-default0-0> on Node <ip-10-4-16-2.us-west-2.compute.internal>.
...
I0612 15:31:12.524094       1 preempt.go:275] No validated victims on Node <ip-10-4-16-2.us-west-2.compute.internal>: not enough resources: requested <cpu 64000.00, memory 927712935936.00, pods 1.00, attachable-volumes-csi-local.csi.openebs.io 4.00, nvidia.comgpu 8000.00, vpc.amazonaws.com/efa 16000.00>, but future idle <cpu 101680.00, memory 1130500558848.00, ephemeral-storage 493695146804000.00, nvidia.com/gpu 0.00, pods 181.00, attachable-volumes-csi-ebs.csi.aws.com 127.00, ttachable-volumes-csi-fsx.csi.aws.com 2147483647.00, hugepages-2Mi 44293947392000.00, attachable-volumes-csi-local.csi.openebs.io 2147483646.00, hugepages-1Gi 0.00, vpc.amazonaws.com/efa 0.00>
...
I0612 15:31:12.526147       1 preempt.go:260] Considering Task <mltraining-dev/test-default0-0> on Node <ip-10-4-16-2.us-west-2.compute.internal>.
...
I0612 15:31:12.526178       1 preempt.go:275] No validated victims on Node <ip-10-4-16-2.us-west-2.compute.internal>: not enough resources: requested <cpu 64000.00, memory 927712935936.00, nvidia.com/gpu 8000.00, vpc.amazonaws.com/efa 16000.00, pods 1.00, attachable-volumes-csi-local.csi.openebs.io 4.00>, but future idle <cpu 101680.00, memory 1130500558848.00, pods 181.00, attachable-volumes-csi-ebs.csi.aws.com 127.00, hugepages-2Mi 44293947392000.00, attachable-volumes-csi-local.csi.openebs.io 2147483646.00, ephemeral-storage 493695146804000.00, nvidia.com/gpu 0.00, vpc.amazonaws.com/efa 0.00, attachable-volumes-csi-fsx.csi.aws.com 2147483647.00, hugepages-1Gi 0.00>

... repeats all over again ...

What's really strange is that ip-10-4-16-2.us-west-2.compute.internal is totally free - nothing is running on that node - but the scheduler reports nvidia.com/gpu available: 0.00.

rooty0 avatar Jun 14 '25 10:06 rooty0

If the pv controller fails to provision the volume, the Annotation of AnnSelectedNode will be deleted.

So, correct me if I'm wrong - but the reason I believe the annotation is being removed by the CSI driver and not the core PV controller (based on the code you linked) is because of what I see in the audit logs: Image

These are all the logs I'm getting for the PVC object that's currently stuck in Pending. From what I can tell, the annotation is initially set by serviceaccount:volcano-system:volcano-scheduler, and then it's removed by system:serviceaccount:openebs:openebs-lvm-controller-sa.

I haven't yet tracked down exactly where this happens in the OpenEBS CSI provisioner codebase, but the Helm deployment for OpenEBS CSI installs the following RBAC (openebs-lvm-provisioner-role):

  • apiGroups: [""] resources: ["persistentvolumeclaims"] verbs: ["get", "list", "watch", "update"]
  • apiGroups: [""] resources: ["persistentvolumeclaims/status"] verbs: ["update", "patch"] So this kind of supports my idea that it's performing an update (not patch) on the PVC, and in doing so, it seems like it ends up overwriting the whole part of the object - which removes the volume.kubernetes.io/selected-node annotation.

The OpenEBS CSI provisioner might just need to update other fields of PVC, but it's not responsible for removing the AnnSelectedNode annotation, actually the External-Provisioner is responsible for removing the annotation if the provision met errors: https://github.com/kubernetes-csi/external-provisioner/blob/1a7e9381439295969ad0336f1e21791f7dc3abe8/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/v11/controller/controller.go#L1379-L1406

When the Volcano scheduler stops updating the volume.kubernetes.io/selected-node annotation, I can confidently say that none of the following conditions have changed:

  • The Volcano Job is still Pending
  • It only has a single Pod, and that Pod is still Pending
  • One of the Pod’s PVCs is still Pending (out of 4 total - the other 3 are already Bound)

The problematic PVC is using a storage class backed by the local.csi.openebs.io provisioner with WaitForFirstConsumer binding mode.

Here's an event graph that was showing the annotation being repeatedly set and removed on the PVC. As of now, nothing has changed in terms of the pending resources - but the annotation activity has stopped: Image As well as the following error:

E0612 15:31:05.387757 1 cache.go:1292] execute preBind failed: binding volumes: provisioning failed for PVC \"test-default0-0-data\", resync the task

From the pic, did the volcano scheduler stop updating annotations after a day? ( From 6.11 until 6.12)

Just to clarify - I do understand why the PVC is stuck in Pending. What I'm trying to figure out is why the Volcano scheduler stops setting the volume.kubernetes.io/selected-node annotation and stops reporting the error above.

I'm not completely sure, but if I remember correctly, we've been seeing this behavior since Volcano 1.10, maybe even starting with 1.9.

Actually, we refactor the volume binding related code in v1.12, don't know whether you can also meet in previous version

JesseStutler avatar Jun 16 '25 12:06 JesseStutler

https://github.com/volcano-sh/volcano/issues/4369#issuecomment-2972612067, strange behavior.... How many resources did your Pod request? I want to confirm whether it is because the scheduler did not clean up the pod on the node after resync @rooty0

JesseStutler avatar Jun 16 '25 12:06 JesseStutler

actually the External-Provisioner is responsible for removing the annotation if the provision met errors

Ah, that makes sense - thanks! I learned today that the external-provisioner isn't actually part of the OpenEBS codebase. It's just a sidecar container that vendors reuse to point to their actual CSI Controller Plugin. Pretty cool :)

From the pic, did the volcano scheduler stop updating annotations after a day? ( From 6.11 until 6.12)

Yeah, it stopped updating the annotations after about a day. The pod from the job has actually been stuck in Pending for 42 days now. Here's the same graph but for 45 days: Image

How many resources did your Pod request? I want to confirm whether it is because the scheduler did not clean up the pod on the node after resync

Here's the live pod manifest:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    scheduling.k8s.io/group-name: test-a895933d-d863-4dfe-90c9-4dfa10d056d7
    volcano.sh/job-name: test
    volcano.sh/job-retry-count: "0"
    volcano.sh/job-version: "0"
    volcano.sh/queue-name: research
    volcano.sh/task-index: "0"
    volcano.sh/task-spec: default0
    volcano.sh/template-uid: test-default0
  creationTimestamp: "2025-06-03T17:27:29Z"
  labels:
    app: test
    component: test
    volcano.sh/job-name: test
    volcano.sh/job-namespace: mltraining-dev
    volcano.sh/queue-name: research
    volcano.sh/task-index: "0"
    volcano.sh/task-spec: default0
  name: test-default0-0
  namespace: mltraining-dev
  ownerReferences:
  - apiVersion: batch.volcano.sh/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: Job
    name: test
    uid: a895933d-d863-4dfe-90c9-4dfa10d056d7
  resourceVersion: "1036111339"
  uid: cd386367-87da-49ce-850b-21629db47d06
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: aaa.com/c-bucket
            operator: In
            values:
            - stb
  containers:
  - command:
    - /bin/bash
    - -c
    - |
      set -euxo pipefail

      sleep infinity
    env:
    - name: MY_POD_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    image: ...omitted...
    imagePullPolicy: IfNotPresent
    name: main
    resources:
      limits:
        cpu: "64"
        memory: 864Gi
        nvidia.com/gpu: "8"
        vpc.amazonaws.com/efa: "16"
      requests:
        cpu: "64"
        memory: 864Gi
        nvidia.com/gpu: "8"
        vpc.amazonaws.com/efa: "16"
    securityContext:
      allowPrivilegeEscalation: true
      capabilities:
        add:
        - NET_RAW
      privileged: true
      readOnlyRootFilesystem: false
      runAsGroup: 1000
      runAsNonRoot: false
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /dev/shm
      name: shm
      subPath: shm
    - mountPath: /tmp
      name: local
      subPath: tmp
    - mountPath: /home/teex/.cache
      name: local
      subPath: home/teex/.cache
    - mountPath: /home/teex/.triton
      name: local
      subPath: home/teex/.triton
    - mountPath: /venv
      name: local
      subPath: venv
    - mountPath: /data
      name: nvme
      subPath: data
    - mountPath: /data-fast
      name: data
    - mountPath: /code
      name: fsx-research
    - mountPath: /checkpoint
      name: fsx-research-checkpoints
    - mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
      name: aws-iam-token
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-n6b7d
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Never
  schedulerName: volcano
  securityContext:
    fsGroup: 1000
    fsGroupChangePolicy: OnRootMismatch
  serviceAccount: mltraining-dev-sa
  serviceAccountName: mltraining-dev-sa
  subdomain: test
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: workergroup
    value: s-p-h200
  - effect: NoSchedule
    key: aaa.com/c-bucket
    value: stb
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
  - effect: NoSchedule
    key: vpc.amazonaws.com/efa
    operator: Exists
  volumes:
  - name: aws-iam-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: sts.amazonaws.com
          expirationSeconds: 86400
          path: token
  - ephemeral:
      volumeClaimTemplate:
        metadata:
          creationTimestamp: null
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 16Ti
          storageClassName: lvm-nvme
          volumeMode: Filesystem
    name: data
  - ephemeral:
      volumeClaimTemplate:
        metadata:
          creationTimestamp: null
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 1Ti
          storageClassName: lvm-nvme
          volumeMode: Filesystem
    name: local
  - emptyDir:
      medium: Memory
      sizeLimit: 16Gi
    name: shm
  - ephemeral:
      volumeClaimTemplate:
        metadata:
          creationTimestamp: null
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 2Ti
          storageClassName: lvm-nvme
          volumeMode: Filesystem
    name: nvme
  - name: fsx-research
    persistentVolumeClaim:
      claimName: fsx-research
  - ephemeral:
      volumeClaimTemplate:
        metadata:
          creationTimestamp: null
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 20Ti
          storageClassName: lvm-nvme
          volumeMode: Filesystem
    name: data-nvme
  - name: fsx-research-checkpoints
    persistentVolumeClaim:
      claimName: fsx-research-checkpoints
  - name: kube-api-access-n6b7d
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2025-06-09T17:49:47Z"
    message: '0/208 nodes are unavailable: 10 Insufficient memory, 12 node(s) had
      volume node affinity conflict, 168 Insufficient cpu, 2 node(s) didn''t match
      Pod''s node affinity/selector, 7 Insufficient vpc.amazonaws.com/efa, 9 Insufficient
      nvidia.com/gpu.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Guaranteed

rooty0 avatar Jun 17 '25 03:06 rooty0

Thanks, I'll try to reproduce the prebind failure scenario and then test it, I can see that your pod also requests CPU and memory, after prebind failure, did the scheduler give the CPU and memory back to node? I want to confirm how many resources the ip-10-4-16-2.us-west-2.compute.internal node has?

BTW, have you fixed the LVM bind failure and schedule pods normally now? If there are still some errors, we can have a meeting to discuss :)

JesseStutler avatar Jun 17 '25 11:06 JesseStutler

Here's a current snapshot of the node's resources:

$ k describe node ip-10-4-16-2.us-west-2.compute.internal
............
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource               Requests     Limits
  --------               --------     ------
  cpu                    1770m (0%)   10770m (5%)
  memory                 3461Mi (0%)  10335Mi (0%)
  ephemeral-storage      0 (0%)       0 (0%)
  hugepages-1Gi          0 (0%)       0 (0%)
  hugepages-2Mi          0 (0%)       0 (0%)
  nvidia.com/gpu         0            0
  vpc.amazonaws.com/efa  0            0
Events:                  <none>
# k get node ip-10-4-16-2.us-west-2.compute.internal -oyaml
#....
  allocatable:
    cpu: 191450m
    ephemeral-storage: "493695146804"
    hugepages-1Gi: "0"
    hugepages-2Mi: 42242Mi
    memory: 2051266916Ki
    nvidia.com/gpu: "8"
    pods: "198"
    vpc.amazonaws.com/efa: "16"
  capacity:
    cpu: "192"
    ephemeral-storage: 536858604Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: 42242Mi
    memory: 2097116516Ki
    nvidia.com/gpu: "8"
    pods: "198"
    vpc.amazonaws.com/efa: "16"

I'm not quite sure how to get Volcano's "assumed pods" - the ones it's tentatively scheduled - for a given node. If there's an easy way to do that, I'd really appreciate any pointers

BTW, have you fixed the LVM bind failure and schedule pods normally now?

I haven't made any changes to that part yet since we're still troubleshooting. I'm trying to leave everything related to this workload untouched so the environment stays in its original state - just to make sure we're not overlooking anything that might help explain what's going on.

The pending PVC is easy to fix, but I’m really focused on figuring out the "side" root cause.

we can have a meeting to discuss

I'd be happy to connect and troubleshoot the volcano scheduler! Please feel free to send me an email to github[at]rooty.name with your availability, and I'll do my best to accommodate your schedule.

rooty0 avatar Jun 17 '25 19:06 rooty0

Hi @rooty0 , I have sent you an email, you can reply me at any time on just ping me on slack :)

I simulated the process of CSI provisioner constantly removing annotations and scheduler constantly adding annotations on my machine (using KIND cluster), but I still can't reproduce your situation (I don't know if it's because I simulated it for a short duration). And I have tested that after Prebind failure, the resyncTask can give resources back to node in schedueler cache, don't know why you found that there are no pods on ip-10-4-16-2.us-west-2.compute.internal node but still can't schedule (As you can see that, ip-10-4-16-2.us-west-2.compute.internal has 192 CPUs, but in your logs there is only 101CPUs in future idle, why is there so much less CPU? Is it because a lot of resources has been allocated in the allocate action? ) . We need more details to clarify in meeting or through slack.

JesseStutler avatar Jun 18 '25 08:06 JesseStutler

I have run the mock process for a long time, but I still can't reproduce this situation( Volcano scheduler still keep adding annotation back)

I simulated the process of CSI provisioner constantly removing annotations and scheduler constantly adding annotations on my machine (using KIND cluster)

JesseStutler avatar Jun 19 '25 01:06 JesseStutler