Volcano scheduler repeatedly adds and removes volume.kubernetes.io/selected-node annotation
Hi,
I'm running into a strange issue with the Volcano scheduler 1.12.1.
I have a PVC that's been stuck in Pending while trying to provision a local LVM volume. That part is expected and I've identified the root cause.
However, what's unexpected is that the scheduler appears to be stuck in a loop: it keeps adding the volume.kubernetes.io/selected-node annotation with the correct node name, waits about 2 seconds, removes the annotation, waits around 8 seconds, adds the annotation again, and then repeats the process over and over again.
Is this behavior expected?
Thanks
@rooty0 Hi, this is the expected behavior. Before binding the pod to the node, the scheduler will first check whether the PV is created and whether the PVC is successfully bound to the PV. This is done by the pv-controller. Since the PVC binding failed in your scenario, the pv-controller will remove the annotation volume.kubernetes.io/selected-node. After the scheduler checks that the volume is not bound, the pod actually fails to be scheduled. Then the scheduler will try to reschedule the pod again. The annotation volume.kubernetes.io/selected-node is updated by the scheduler before checking whether the PVC is bound, therefore every time the binding fails, there will be a removal and update annotation process. Only when the pv.kubernetes.io/bind-completed annotation is updated does it indicate that the PVC is successfully bound.
Thanks so much for the detailed explanation, @JesseStutler !
Just to clarify - the PVC is currently in the Pending state. So you're saying it's the Persistent Volume Controller that's actually removing the annotation in this case? That's really helpful to know, thank you!
I had assumed it was the scheduler removing the volume.kubernetes.io/selected-node annotation about 1-2 seconds after setting it, which technically could be confusing for the CSI driver
I just checked the logs and it turns out it's actually the CIS controller that deletes the annotation - not the PV controller. You were probably referring to the CIS controller earlier, but I missed that and thought you meant the core ControllerManager with the PV controller.
Quick follow-up: is it expected for the Volcano scheduler to eventually stop setting the annotation? Starting from yesterday, I saw it repeatedly adding the annotation after it gets removed, but today there haven’t been any changes. The Volcano job is still pending, the PVC is still pending, but the scheduler no longer sets the annotation. Just wondering if that's normal behavior.
You can see here: https://github.com/kubernetes/kubernetes/blob/ad4cc125c92246b756186edcded475873fca796f/pkg/controller/volume/persistentvolume/pv_controller.go#L1865-L1885. If the pv controller fails to provision the volume, the Annotation of AnnSelectedNode will be deleted. You can pay attention to whether the Pod has been scheduled. If the Pod has been scheduled, the pvc annotation will not be removed. @rooty0
Regarding the reason for eventually stop setting the annotation, it may be that the pod has been scheduled. What kind of mode did you use, WaitForFirstConsumer or Immediate ? I also want to know if the pod is still pending but has been scheduled?
If the pv controller fails to provision the volume, the Annotation of AnnSelectedNode will be deleted.
So, correct me if I'm wrong - but the reason I believe the annotation is being removed by the CSI driver and not the core PV controller (based on the code you linked) is because of what I see in the audit logs:
These are all the logs I'm getting for the PVC object that's currently stuck in Pending. From what I can tell, the annotation is initially set by serviceaccount:volcano-system:volcano-scheduler, and then it's removed by system:serviceaccount:openebs:openebs-lvm-controller-sa.
I haven't yet tracked down exactly where this happens in the OpenEBS CSI provisioner codebase, but the Helm deployment for OpenEBS CSI installs the following RBAC (openebs-lvm-provisioner-role):
- apiGroups: [""]
resources: ["persistentvolumeclaims"]
verbs: ["get", "list", "watch", "update"]
- apiGroups: [""]
resources: ["persistentvolumeclaims/status"]
verbs: ["update", "patch"]
So this kind of supports my idea that it's performing an update (not patch) on the PVC, and in doing so, it seems like it ends up overwriting the whole part of the object - which removes the volume.kubernetes.io/selected-node annotation.
When the Volcano scheduler stops updating the volume.kubernetes.io/selected-node annotation, I can confidently say that none of the following conditions have changed:
- The Volcano Job is still
Pending - It only has a single Pod, and that Pod is still
Pending - One of the Pod’s PVCs is still
Pending(out of 4 total - the other 3 are alreadyBound)
The problematic PVC is using a storage class backed by the local.csi.openebs.io provisioner with WaitForFirstConsumer binding mode.
Here's an event graph that was showing the annotation being repeatedly set and removed on the PVC. As of now, nothing has changed in terms of the pending resources - but the annotation activity has stopped:
As well as the following error:
E0612 15:31:05.387757 1 cache.go:1292] execute preBind failed: binding volumes: provisioning failed for PVC \"test-default0-0-data\", resync the task
Just to clarify - I do understand why the PVC is stuck in Pending. What I'm trying to figure out is why the Volcano scheduler stops setting the volume.kubernetes.io/selected-node annotation and stops reporting the error above.
I'm not completely sure, but if I remember correctly, we've been seeing this behavior since Volcano 1.10, maybe even starting with 1.9.
The only difference I’ve noticed in the volcano scheduler logs is:
- Before the scheduler stopped setting the annotation:
I0612 15:30:41.946518 1 session_plugins.go:557] JobOrderFn:using creationTimestamp to order job priority notebook-testing-407ba657-aa1b-4e42-b963-ca6f11a67a0a: 2025-05-05 17:48:23 +0000 UTC -- test-a895933d-d863-4dfe-90c9-4dfa10d056d7: 2025-05-14 16:49:06 +0000 UTC
...
I0612 15:30:41.947869 1 allocate.go:182] Try to allocate resource to 1 tasks of Job <mltraining-dev/test-a895933d-d863-4dfe-90c9-4dfa10d056d7>
"I0612 15:30:41.947880 1 allocate.go:376] There are <216> nodes for Job <mltraining-dev/test-a895933d-d863-4dfe-90c9-4dfa10d056d7>
I0612 15:30:41.949051 1 allocate.go:551] Binding Task <mltraining-dev/test-default0-0> to node <ip-10-4-16-2.us-west-2.compute.internal>
I0612 15:30:41.949067 1 statement.go:284] After allocated Task <mltraining-dev/test-default0-0> to Node <ip-10-4-16-2.us-west-2.compute.internal>: idle <cpu 125680.00, memory 1169155264512.00, hugepages-2Mi 44293947392000.00, attachable-volumes-csi-local.csi.openebs.io 2147483643.00, nvidia.com/gpu 0.00, hugepages-1Gi 0.00, vpc.amazonaws.com/efa 0.00, attachable-volumes-csi-ebs.csi.aws.com 127.00, attachable-volumes-csi-fsx.csi.aws.com 2147483647.00, ephemeral-storage 493695146804000.00, pods 181.00>, used <cpu 65770.00, memory 931342057472.00, pods 17.00, nvidia.com/gpu 8000.00, vpc.amazonaws.com/efa 16000.00, attachable-volumes-csi-local.csi.openebs.io 4.00>, releasing <cpu 0.00, memory 0.00>
...
I0612 15:30:41.949156 1 allocate.go:443] \"Job ready, return statement\" jobName=\"mltraining-dev/test-a895933d-d863-4dfe-90c9-4dfa10d056d7\"
...
E0612 15:30:44.985646 1 cache.go:1292] execute preBind failed: binding volumes: provisioning failed for PVC \"test-default0-0-data\", resync the task
...
I0612 15:30:52.095041 1 session_plugins.go:557] JobOrderFn:using creationTimestamp to order job priority test-a895933d-d863-4dfe-90c9-4dfa10d056d7: 2025-05-14 16:49:06 +0000 UTC -- notebook-testing-407ba657-aa1b-4e42-b963-ca6f11a67a0a: 2025-05-05 17:48:23 +0000 UTC
...
I0612 15:30:52.096542 1 allocate.go:182] Try to allocate resource to 1 tasks of Job <mltraining-dev/test-a895933d-d863-4dfe-90c9-4dfa10d056d7>
I0612 15:30:52.096554 1 allocate.go:376] There are <218> nodes for Job <mltraining-dev/test-a895933d-d863-4dfe-90c9-4dfa10d056d7>
I0612 15:30:52.098115 1 allocate.go:551] Binding Task <mltraining-dev/test-default0-0> to node <ip-10-4-16-2.us-west-2.compute.internal>
I0612 15:30:52.098136 1 statement.go:284] After allocated Task <mltraining-dev/test-default0-0> to Node <ip-10-4-16-2.us-west-2.compute.internal>: idle <cpu 125680.00, memory 1169155264512.00, nvidia.com/gpu 0.00, attachable-volumes-csi-ebs.csi.aws.com 127.00, attachable-volumes-csi-fsx.csi.aws.com 2147483647.00, ephemeral-storage 493695146804000.00, hugepages-2Mi 44293947392000.00, pods 181.00, vpc.amazonaws.com/efa 0.00, attachable-volumes-csi-local.csi.openebs.io 2147483643.00, hugepages-1Gi 0.00>, used <cpu 65770.00, memory 931342057472.00, pods 17.00, vpc.amazonaws.com/efa 16000.00, attachable-volumes-csi-local.csi.openebs.io 4.00, nvidia.com/gpu 8000.00>, releasing <cpu 0.00, memory 0.00>
...
I0612 15:30:52.098242 1 allocate.go:443] \"Job ready, return statement\" jobName=\"mltraining-dev/test-a895933d-d863-4dfe-90c9-4dfa10d056d7\"
...
E0612 15:30:55.184803 1 cache.go:1292] execute preBind failed: binding volumes: provisioning failed for PVC \"test-default0-0-data\", resync the task
... repeats all over again ...
- After the scheduler stopped setting the annotation:
I0612 15:31:12.507372 1 session_plugins.go:557] JobOrderFn:using creationTimestamp to order job priority test-a895933d-d863-4dfe-90c9-4dfa10d056d7: 2025-05-14 16:49:06 +0000 UTC -- notebook-testing-407ba657-aa1b-4e42-b963-ca6f11a67a0a: 2025-05-05 7:48:23 +0000 UTC
...
I0612 15:31:12.508954 1 allocate.go:182] Try to allocate resource to 1 tasks of Job <mltraining-dev/test-a895933d-d863-4dfe-90c9-4dfa10d056d7>
I0612 15:31:12.508966 1 allocate.go:376] There are <218> nodes for Job <mltraining-dev/test-a895933d-d863-4dfe-90c9-4dfa10d056d7>
...
I0612 15:31:12.519554 1 preempt.go:260] Considering Task <mltraining-dev/test-default0-0> on Node <ip-10-4-16-2.us-west-2.compute.internal>.
...
I0612 15:31:12.519586 1 preempt.go:275] No validated victims on Node <ip-10-4-16-2.us-west-2.compute.internal>: not enough resources: requested <cpu 64000.00, memory 927712935936.00, vpc.amazonaws.com/efa 16000.00, pods 1.00, ttachable-volumes-csi-local.csi.openebs.io 4.00, nvidia.com/gpu 8000.00>, but future idle <cpu 101680.00, memory 1130500558848.00, nvidia.com/gpu 0.00, vpc.amazonaws.com/efa 0.00, ephemeral-storage 493695146804000.00, pods 181.00, ttachable-volumes-csi-ebs.csi.aws.com 127.00, hugepages-1Gi 0.00, attachable-volumes-csi-local.csi.openebs.io 2147483646.00, attachable-volumes-csi-fsx.csi.aws.com 2147483647.00, hugepages-2Mi 44293947392000.00>
...
I0612 15:31:12.521859 1 preempt.go:260] Considering Task <mltraining-dev/test-default0-0> on Node <ip-10-4-16-2.us-west-2.compute.internal>.
...
I0612 15:31:12.521897 1 preempt.go:275] No validated victims on Node <ip-10-4-16-2.us-west-2.compute.internal>: not enough resources: requested <cpu 64000.00, memory 927712935936.00, attachable-volumes-csi-local.csi.openebs.io 4.00, nvidia.com/gpu 000.00, vpc.amazonaws.com/efa 16000.00, pods 1.00>, but future idle <cpu 101680.00, memory 1130500558848.00, nvidia.com/gpu 0.00, hugepages-2Mi 44293947392000.00, attachable-volumes-csi-ebs.csi.aws.com 127.00, vpc.amazonaws.com/efa 0.00, ttachable-volumes-csi-fsx.csi.aws.com 2147483647.00, hugepages-1Gi 0.00, attachable-volumes-csi-local.csi.openebs.io 2147483646.00, ephemeral-storage 493695146804000.00, pods 181.00>
...
I0612 15:31:12.524059 1 preempt.go:260] Considering Task <mltraining-dev/test-default0-0> on Node <ip-10-4-16-2.us-west-2.compute.internal>.
...
I0612 15:31:12.524094 1 preempt.go:275] No validated victims on Node <ip-10-4-16-2.us-west-2.compute.internal>: not enough resources: requested <cpu 64000.00, memory 927712935936.00, pods 1.00, attachable-volumes-csi-local.csi.openebs.io 4.00, nvidia.comgpu 8000.00, vpc.amazonaws.com/efa 16000.00>, but future idle <cpu 101680.00, memory 1130500558848.00, ephemeral-storage 493695146804000.00, nvidia.com/gpu 0.00, pods 181.00, attachable-volumes-csi-ebs.csi.aws.com 127.00, ttachable-volumes-csi-fsx.csi.aws.com 2147483647.00, hugepages-2Mi 44293947392000.00, attachable-volumes-csi-local.csi.openebs.io 2147483646.00, hugepages-1Gi 0.00, vpc.amazonaws.com/efa 0.00>
...
I0612 15:31:12.526147 1 preempt.go:260] Considering Task <mltraining-dev/test-default0-0> on Node <ip-10-4-16-2.us-west-2.compute.internal>.
...
I0612 15:31:12.526178 1 preempt.go:275] No validated victims on Node <ip-10-4-16-2.us-west-2.compute.internal>: not enough resources: requested <cpu 64000.00, memory 927712935936.00, nvidia.com/gpu 8000.00, vpc.amazonaws.com/efa 16000.00, pods 1.00, attachable-volumes-csi-local.csi.openebs.io 4.00>, but future idle <cpu 101680.00, memory 1130500558848.00, pods 181.00, attachable-volumes-csi-ebs.csi.aws.com 127.00, hugepages-2Mi 44293947392000.00, attachable-volumes-csi-local.csi.openebs.io 2147483646.00, ephemeral-storage 493695146804000.00, nvidia.com/gpu 0.00, vpc.amazonaws.com/efa 0.00, attachable-volumes-csi-fsx.csi.aws.com 2147483647.00, hugepages-1Gi 0.00>
... repeats all over again ...
What's really strange is that ip-10-4-16-2.us-west-2.compute.internal is totally free - nothing is running on that node - but the scheduler reports nvidia.com/gpu available: 0.00.
If the pv controller fails to provision the volume, the Annotation of AnnSelectedNode will be deleted.
So, correct me if I'm wrong - but the reason I believe the annotation is being removed by the CSI driver and not the core PV controller (based on the code you linked) is because of what I see in the audit logs:
These are all the logs I'm getting for the PVC object that's currently stuck in Pending. From what I can tell, the annotation is initially set by
serviceaccount:volcano-system:volcano-scheduler, and then it's removed bysystem:serviceaccount:openebs:openebs-lvm-controller-sa.I haven't yet tracked down exactly where this happens in the OpenEBS CSI provisioner codebase, but the Helm deployment for OpenEBS CSI installs the following RBAC (
openebs-lvm-provisioner-role):
- apiGroups: [""] resources: ["persistentvolumeclaims"] verbs: ["get", "list", "watch", "update"]
- apiGroups: [""] resources: ["persistentvolumeclaims/status"] verbs: ["update", "patch"] So this kind of supports my idea that it's performing an update (not patch) on the PVC, and in doing so, it seems like it ends up overwriting the whole part of the object - which removes the
volume.kubernetes.io/selected-nodeannotation.
The OpenEBS CSI provisioner might just need to update other fields of PVC, but it's not responsible for removing the AnnSelectedNode annotation, actually the External-Provisioner is responsible for removing the annotation if the provision met errors: https://github.com/kubernetes-csi/external-provisioner/blob/1a7e9381439295969ad0336f1e21791f7dc3abe8/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/v11/controller/controller.go#L1379-L1406
When the Volcano scheduler stops updating the
volume.kubernetes.io/selected-nodeannotation, I can confidently say that none of the following conditions have changed:
- The Volcano Job is still
Pending- It only has a single Pod, and that Pod is still
Pending- One of the Pod’s PVCs is still
Pending(out of 4 total - the other 3 are alreadyBound)The problematic PVC is using a storage class backed by the
local.csi.openebs.ioprovisioner withWaitForFirstConsumerbinding mode.Here's an event graph that was showing the annotation being repeatedly set and removed on the PVC. As of now, nothing has changed in terms of the pending resources - but the annotation activity has stopped:
As well as the following error:
E0612 15:31:05.387757 1 cache.go:1292] execute preBind failed: binding volumes: provisioning failed for PVC \"test-default0-0-data\", resync the task
From the pic, did the volcano scheduler stop updating annotations after a day? ( From 6.11 until 6.12)
Just to clarify - I do understand why the PVC is stuck in
Pending. What I'm trying to figure out is why the Volcano scheduler stops setting thevolume.kubernetes.io/selected-nodeannotation and stops reporting the error above.I'm not completely sure, but if I remember correctly, we've been seeing this behavior since Volcano
1.10, maybe even starting with1.9.
Actually, we refactor the volume binding related code in v1.12, don't know whether you can also meet in previous version
https://github.com/volcano-sh/volcano/issues/4369#issuecomment-2972612067, strange behavior.... How many resources did your Pod request? I want to confirm whether it is because the scheduler did not clean up the pod on the node after resync @rooty0
actually the External-Provisioner is responsible for removing the annotation if the provision met errors
Ah, that makes sense - thanks! I learned today that the external-provisioner isn't actually part of the OpenEBS codebase. It's just a sidecar container that vendors reuse to point to their actual CSI Controller Plugin. Pretty cool :)
From the pic, did the volcano scheduler stop updating annotations after a day? ( From 6.11 until 6.12)
Yeah, it stopped updating the annotations after about a day. The pod from the job has actually been stuck in Pending for 42 days now. Here's the same graph but for 45 days:
How many resources did your Pod request? I want to confirm whether it is because the scheduler did not clean up the pod on the node after resync
Here's the live pod manifest:
apiVersion: v1
kind: Pod
metadata:
annotations:
scheduling.k8s.io/group-name: test-a895933d-d863-4dfe-90c9-4dfa10d056d7
volcano.sh/job-name: test
volcano.sh/job-retry-count: "0"
volcano.sh/job-version: "0"
volcano.sh/queue-name: research
volcano.sh/task-index: "0"
volcano.sh/task-spec: default0
volcano.sh/template-uid: test-default0
creationTimestamp: "2025-06-03T17:27:29Z"
labels:
app: test
component: test
volcano.sh/job-name: test
volcano.sh/job-namespace: mltraining-dev
volcano.sh/queue-name: research
volcano.sh/task-index: "0"
volcano.sh/task-spec: default0
name: test-default0-0
namespace: mltraining-dev
ownerReferences:
- apiVersion: batch.volcano.sh/v1alpha1
blockOwnerDeletion: true
controller: true
kind: Job
name: test
uid: a895933d-d863-4dfe-90c9-4dfa10d056d7
resourceVersion: "1036111339"
uid: cd386367-87da-49ce-850b-21629db47d06
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: aaa.com/c-bucket
operator: In
values:
- stb
containers:
- command:
- /bin/bash
- -c
- |
set -euxo pipefail
sleep infinity
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
image: ...omitted...
imagePullPolicy: IfNotPresent
name: main
resources:
limits:
cpu: "64"
memory: 864Gi
nvidia.com/gpu: "8"
vpc.amazonaws.com/efa: "16"
requests:
cpu: "64"
memory: 864Gi
nvidia.com/gpu: "8"
vpc.amazonaws.com/efa: "16"
securityContext:
allowPrivilegeEscalation: true
capabilities:
add:
- NET_RAW
privileged: true
readOnlyRootFilesystem: false
runAsGroup: 1000
runAsNonRoot: false
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /dev/shm
name: shm
subPath: shm
- mountPath: /tmp
name: local
subPath: tmp
- mountPath: /home/teex/.cache
name: local
subPath: home/teex/.cache
- mountPath: /home/teex/.triton
name: local
subPath: home/teex/.triton
- mountPath: /venv
name: local
subPath: venv
- mountPath: /data
name: nvme
subPath: data
- mountPath: /data-fast
name: data
- mountPath: /code
name: fsx-research
- mountPath: /checkpoint
name: fsx-research-checkpoints
- mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
name: aws-iam-token
readOnly: true
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-n6b7d
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Never
schedulerName: volcano
securityContext:
fsGroup: 1000
fsGroupChangePolicy: OnRootMismatch
serviceAccount: mltraining-dev-sa
serviceAccountName: mltraining-dev-sa
subdomain: test
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: workergroup
value: s-p-h200
- effect: NoSchedule
key: aaa.com/c-bucket
value: stb
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
- effect: NoSchedule
key: vpc.amazonaws.com/efa
operator: Exists
volumes:
- name: aws-iam-token
projected:
defaultMode: 420
sources:
- serviceAccountToken:
audience: sts.amazonaws.com
expirationSeconds: 86400
path: token
- ephemeral:
volumeClaimTemplate:
metadata:
creationTimestamp: null
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 16Ti
storageClassName: lvm-nvme
volumeMode: Filesystem
name: data
- ephemeral:
volumeClaimTemplate:
metadata:
creationTimestamp: null
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Ti
storageClassName: lvm-nvme
volumeMode: Filesystem
name: local
- emptyDir:
medium: Memory
sizeLimit: 16Gi
name: shm
- ephemeral:
volumeClaimTemplate:
metadata:
creationTimestamp: null
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Ti
storageClassName: lvm-nvme
volumeMode: Filesystem
name: nvme
- name: fsx-research
persistentVolumeClaim:
claimName: fsx-research
- ephemeral:
volumeClaimTemplate:
metadata:
creationTimestamp: null
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Ti
storageClassName: lvm-nvme
volumeMode: Filesystem
name: data-nvme
- name: fsx-research-checkpoints
persistentVolumeClaim:
claimName: fsx-research-checkpoints
- name: kube-api-access-n6b7d
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2025-06-09T17:49:47Z"
message: '0/208 nodes are unavailable: 10 Insufficient memory, 12 node(s) had
volume node affinity conflict, 168 Insufficient cpu, 2 node(s) didn''t match
Pod''s node affinity/selector, 7 Insufficient vpc.amazonaws.com/efa, 9 Insufficient
nvidia.com/gpu.'
reason: Unschedulable
status: "False"
type: PodScheduled
phase: Pending
qosClass: Guaranteed
Thanks, I'll try to reproduce the prebind failure scenario and then test it, I can see that your pod also requests CPU and memory, after prebind failure, did the scheduler give the CPU and memory back to node? I want to confirm how many resources the ip-10-4-16-2.us-west-2.compute.internal node has?
BTW, have you fixed the LVM bind failure and schedule pods normally now? If there are still some errors, we can have a meeting to discuss :)
Here's a current snapshot of the node's resources:
$ k describe node ip-10-4-16-2.us-west-2.compute.internal
............
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1770m (0%) 10770m (5%)
memory 3461Mi (0%) 10335Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 0 0
vpc.amazonaws.com/efa 0 0
Events: <none>
# k get node ip-10-4-16-2.us-west-2.compute.internal -oyaml
#....
allocatable:
cpu: 191450m
ephemeral-storage: "493695146804"
hugepages-1Gi: "0"
hugepages-2Mi: 42242Mi
memory: 2051266916Ki
nvidia.com/gpu: "8"
pods: "198"
vpc.amazonaws.com/efa: "16"
capacity:
cpu: "192"
ephemeral-storage: 536858604Ki
hugepages-1Gi: "0"
hugepages-2Mi: 42242Mi
memory: 2097116516Ki
nvidia.com/gpu: "8"
pods: "198"
vpc.amazonaws.com/efa: "16"
I'm not quite sure how to get Volcano's "assumed pods" - the ones it's tentatively scheduled - for a given node. If there's an easy way to do that, I'd really appreciate any pointers
BTW, have you fixed the LVM bind failure and schedule pods normally now?
I haven't made any changes to that part yet since we're still troubleshooting. I'm trying to leave everything related to this workload untouched so the environment stays in its original state - just to make sure we're not overlooking anything that might help explain what's going on.
The pending PVC is easy to fix, but I’m really focused on figuring out the "side" root cause.
we can have a meeting to discuss
I'd be happy to connect and troubleshoot the volcano scheduler! Please feel free to send me an email to github[at]rooty.name with your availability, and I'll do my best to accommodate your schedule.
Hi @rooty0 , I have sent you an email, you can reply me at any time on just ping me on slack :)
I simulated the process of CSI provisioner constantly removing annotations and scheduler constantly adding annotations on my machine (using KIND cluster), but I still can't reproduce your situation (I don't know if it's because I simulated it for a short duration). And I have tested that after Prebind failure, the resyncTask can give resources back to node in schedueler cache, don't know why you found that there are no pods on ip-10-4-16-2.us-west-2.compute.internal node but still can't schedule (As you can see that, ip-10-4-16-2.us-west-2.compute.internal has 192 CPUs, but in your logs there is only 101CPUs in future idle, why is there so much less CPU? Is it because a lot of resources has been allocated in the allocate action? ) . We need more details to clarify in meeting or through slack.
I have run the mock process for a long time, but I still can't reproduce this situation( Volcano scheduler still keep adding annotation back)
I simulated the process of CSI provisioner constantly removing annotations and scheduler constantly adding annotations on my machine (using KIND cluster)