csi-driver
csi-driver copied to clipboard
No topology key found on hw nodes
In clusters with hardware nodes, a new PVC and its workload can be stuck in Pending state if they are scheduled without nodeAffinity.
Steps to reproduce:
- run a cluster that includes a hardware worker, and label the hw node with
instance.hetzner.cloud/is-root-server=trueas mentioned in the README - install CSI driver according to instructions
- apply the test-pvc and pod mentioned in the README, using the default storageClass with
WaitForFirstConsumervolumeBindingMode
Expected Behaviour:
hcloud-csi-controller should provide the desired / required topology constaints to the k8s scheduler, which then schedules the pod on a node fulfilling the topology requirements. As the hardware node does not run csi-driver and cannot mount hetzner cloud volumes, the workload should not be scheduled there.
Observed Behaviour:
- Both pvc and pod are stuck in
Pendingstate. - the container
csi-provisionerof the CSI Controller deployment logs this Error:
'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "hcloud-volumes": error generating accessibility requirements: no topology key found on CSINode hardwarenode.testcluster
More Info: Tested with csi-driver 2.1.1 as well as 2.2.0, together with csi-provisioner 3.4.0
- the DaemonSet for
hcloud-csi-nodedoes not run on the hw node - because of this, the
csinodeobject for the node lists no driver:
kubectl get csinode
NAME DRIVERS AGE
virtualnode.testcluster 1 1d
hardwarenode.testcluster 0 1d
- the
csinodeobject of the virtual node looks ok:
kubectl get csinode virtualnode.testcluster -oyaml
apiVersion: storage.k8s.io/v1
kind: CSINode
...
spec:
drivers:
- allocatable:
count: 16
name: csi.hetzner.cloud
nodeID: "12769030"
topologyKeys:
- csi.hetzner.cloud/location
- the
csinodeobject of the hardware node does not have a driver and therefore no topology key, as the node intentionally runs nohcloud-csi-nodepod due to thenodeAffinity:
kubectl get csinode hardwarenode.testcluster -oyaml
apiVersion: storage.k8s.io/v1
kind: CSINode
...
spec:
drivers: null
Theory
It seems we are hitting this Issue in csi-provisioner.
As the hardware node has no csi-driver pod and therefore no driver or topology key listed, the csi-provisioner breaks. It is trying to build the preferred topology to give it to the scheduler, but as the hardware node has no topology key, the csi-provisioner fails. Pod and PVC cannot finish scheduling and remain in Pending state forever.
Workaround
This issue can be avoided by making sure the object that uses the PVC (StatefulSet, Pod etc.) cannot be scheduled on the hardware node in the first place. This can be done by specifying a nodeAffinity:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: instance.hetzner.cloud/is-root-server
operator: NotIn
values:
- "true"
Proposed Solution
The external-provisioner Issue, lists a few possible solutions on the csi-driver side, such as running the csi-driver on all nodes, including hardware nodes. CSI-controller would then need to be aware of which nodes are virtual or hardware when providing the topology preferences to the k8s scheduler.
Having the same issue, seems that the wrong node is selected:
- apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
volume.beta.kubernetes.io/storage-provisioner: csi.hetzner.cloud
volume.kubernetes.io/selected-node: production-agent-large-srd
volume.kubernetes.io/storage-provisioner: csi.hetzner.cloud
creationTimestamp: "2023-03-24T05:54:30Z"
finalizers:
- kubernetes.io/pvc-protection
labels:
app.kubernetes.io/component: primary
app.kubernetes.io/instance: pcf-app
app.kubernetes.io/name: postgresql
name: data-pcf-app-postgresql-0
namespace: pen-testing
resourceVersion: "4164426"
uid: 0c39bdac-5540-4a34-b274-151a6409cdbf
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 8Gi
storageClassName: hcloud-volumes
volumeMode: Filesystem
status:
phase: Pending
- apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
meta.helm.sh/release-name: reconmap-app
meta.helm.sh/release-namespace: pen-testing
pv.kubernetes.io/bind-completed: "yes"
pv.kubernetes.io/bound-by-controller: "yes"
volume.beta.kubernetes.io/storage-provisioner: csi.hetzner.cloud
volume.kubernetes.io/selected-node: production-storage-yhq
volume.kubernetes.io/storage-provisioner: csi.hetzner.cloud
creationTimestamp: "2023-03-22T08:09:04Z"
finalizers:
- kubernetes.io/pvc-protection
labels:
app: mysql
app.kubernetes.io/managed-by: Helm
name: reconmap-app-mysql-pv-claim
namespace: pen-testing
resourceVersion: "3367563"
uid: e355ac30-2136-4193-8264-04e33bc335c8
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
storageClassName: hcloud-volumes
volumeMode: Filesystem
volumeName: pvc-e355ac30-2136-4193-8264-04e33bc335c8
status:
accessModes:
- ReadWriteOnce
capacity:
storage: 20Gi
phase: Bound
Seeing this logs:
17m Normal WaitForFirstConsumer persistentvolumeclaim/data-pcf-app-postgresql-0 waiting for first consumer to be created before binding
12m Normal ExternalProvisioning persistentvolumeclaim/data-pcf-app-postgresql-0 waiting for a volume to be created, either by external provisioner "csi.hetzner.cloud" or manually created by system administrator
12m Normal Provisioning persistentvolumeclaim/data-pcf-app-postgresql-0 External provisioner is provisioning volume for claim "pen-testing/data-pcf-app-postgresql-0"
12m Warning ProvisioningFailed persistentvolumeclaim/data-pcf-app-postgresql-0 failed to provision volume with StorageClass "hcloud-volumes": error generating accessibility requirements: no topology key found on CSINode production-agent-large-srd
10m Normal WaitForFirstConsumer persistentvolumeclaim/data-pcf-app-postgresql-0 waiting for first consumer to be created before binding
6s Normal ExternalProvisioning persistentvolumeclaim/data-pcf-app-postgresql-0 waiting for a volume to be created, either by external provisioner "csi.hetzner.cloud" or manually created by system administrator
61s Normal Provisioning persistentvolumeclaim/data-pcf-app-postgresql-0 External provisioner is provisioning volume for claim "pen-testing/data-pcf-app-postgresql-0"
61s Warning ProvisioningFailed persistentvolumeclaim/data-pcf-app-postgresql-0 failed to provision volume with StorageClass "hcloud-volumes": error generating accessibility requirements: no topology key found on CSINode production-agent-large-srd
When updating manually the volume.kubernetes.io/selected-node annotation to production-storage-yhq it's working
As per the hint in linked issue, perhaps this can be easily solved by setting allowed topologies on the StorageClass? That is, assuming the StorageClass has an allowedTopologies selector that accurately matches hcloud Nodes only, then we can be sure the Kubernetes scheduler won't try to schedule a Pod with hcloud PVC attachment(s) on non-hcloud nodes.
This only solves the issue for Kube, I have no idea about Swarm/Nomad.
This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.
This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.