csi-driver
csi-driver copied to clipboard
CSI Mount Error on Hetzner
Hi all,
hope somebody has a hint for me as I already invested hours without a solution, in the beginning everything with my kuberone cluster on Hetzner worked as expected. auto provisioning of block storage over the csi interface is exactly as it should. Since days across the different namespaces errors occur at initial pod starting after provisioning via helm
the error message at pod startup
MountVolume.MountDevice failed for volume "pvc-a0a20263-9462-4c3e-9ebb-be45b92da7f4" : rpc error: code = Internal desc = failed to stage volume: format of disk "/dev/disk/by-id/scsi-0HC_Volume_16943979" failed: type:("ext4") target:("/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-a0a20263-9462-4c3e-9ebb-be45b92da7f4/globalmount") options:("defaults") errcode:(exit status 1) output:(mke2fs 1.45.7 (28-Jan-2021) The file /dev/disk/by-id/scsi-0HC_Volume_16943979 does not exist and no size was specified. )
Unable to attach or mount volumes: unmounted volumes=[redis-pvc], unattached volumes=[redis-pvc default-token-g7mjq]: timed out waiting for the condition
parts of the deployment file
spec:
volumes:
- name: redis-pvc
persistentVolumeClaim:
claimName: redis-pvc-chirpstack-redis-0
- name: default-token-g7mjq
secret:
secretName: default-token-g7mjq
defaultMode: 420
volumeMounts:
- name: redis-pvc
mountPath: /data
- name: default-token-g7mjq
readOnly: true
mountPath: /var/run/secrets/kubernetes.io/serviceaccount
pvc is like this
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: redis-pvc-chirpstack-redis-0
namespace: chirpstack
selfLink: >-
/api/v1/namespaces/chirpstack/persistentvolumeclaims/redis-pvc-chirpstack-redis-0
uid: 70520ae2-b99f-4d7f-a625-5e97d1748dd9
resourceVersion: '2367972'
creationTimestamp: '2022-02-15T17:44:04Z'
labels:
app: redis
release: chirpstack
annotations:
pv.kubernetes.io/bind-completed: 'yes'
pv.kubernetes.io/bound-by-controller: 'yes'
volume.beta.kubernetes.io/storage-provisioner: csi.hetzner.cloud
volume.kubernetes.io/selected-node: oc4-pool1-779fc8f494-whkcs
finalizers:
- kubernetes.io/pvc-protection
managedFields:
- manager: kube-scheduler
operation: Update
apiVersion: v1
time: '2022-02-15T17:44:04Z'
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:volume.kubernetes.io/selected-node: {}
- manager: kube-controller-manager
operation: Update
apiVersion: v1
time: '2022-02-15T17:44:07Z'
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
f:pv.kubernetes.io/bind-completed: {}
f:pv.kubernetes.io/bound-by-controller: {}
f:volume.beta.kubernetes.io/storage-provisioner: {}
f:labels:
.: {}
f:app: {}
f:release: {}
f:spec:
f:accessModes: {}
f:resources:
f:requests:
.: {}
f:storage: {}
f:storageClassName: {}
f:volumeMode: {}
f:volumeName: {}
f:status:
f:accessModes: {}
f:capacity:
.: {}
f:storage: {}
f:phase: {}
status:
phase: Bound
accessModes:
- ReadWriteOnce
capacity:
storage: 10Gi
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 500M
volumeName: pvc-70520ae2-b99f-4d7f-a625-5e97d1748dd9
storageClassName: hcloud-volumes
volumeMode: Filesystem
pv
apiVersion: v1
kind: PersistentVolume
metadata:
name: pvc-70520ae2-b99f-4d7f-a625-5e97d1748dd9
selfLink: /api/v1/persistentvolumes/pvc-70520ae2-b99f-4d7f-a625-5e97d1748dd9
uid: cc6f8d63-a75f-4954-88d3-4ad5c68ffbc1
resourceVersion: '2367987'
creationTimestamp: '2022-02-15T17:44:07Z'
annotations:
pv.kubernetes.io/provisioned-by: csi.hetzner.cloud
finalizers:
- kubernetes.io/pv-protection
- external-attacher/csi-hetzner-cloud
managedFields:
- manager: csi-provisioner
operation: Update
apiVersion: v1
time: '2022-02-15T17:44:07Z'
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:pv.kubernetes.io/provisioned-by: {}
f:spec:
f:accessModes: {}
f:capacity:
.: {}
f:storage: {}
f:claimRef:
.: {}
f:apiVersion: {}
f:kind: {}
f:name: {}
f:namespace: {}
f:resourceVersion: {}
f:uid: {}
f:csi:
.: {}
f:driver: {}
f:fsType: {}
f:volumeAttributes:
.: {}
f:storage.kubernetes.io/csiProvisionerIdentity: {}
f:volumeHandle: {}
f:nodeAffinity:
.: {}
f:required:
.: {}
f:nodeSelectorTerms: {}
f:persistentVolumeReclaimPolicy: {}
f:storageClassName: {}
f:volumeMode: {}
- manager: kube-controller-manager
operation: Update
apiVersion: v1
time: '2022-02-15T17:44:07Z'
fieldsType: FieldsV1
fieldsV1:
f:status:
f:phase: {}
- manager: csi-attacher
operation: Update
apiVersion: v1
time: '2022-02-15T17:44:08Z'
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
v:"external-attacher/csi-hetzner-cloud": {}
status:
phase: Bound
spec:
capacity:
storage: 10Gi
csi:
driver: csi.hetzner.cloud
volumeHandle: '16943978'
fsType: ext4
volumeAttributes:
storage.kubernetes.io/csiProvisionerIdentity: 1644479434542-8081-csi.hetzner.cloud
accessModes:
- ReadWriteOnce
claimRef:
kind: PersistentVolumeClaim
namespace: chirpstack
name: redis-pvc-chirpstack-redis-0
uid: 70520ae2-b99f-4d7f-a625-5e97d1748dd9
apiVersion: v1
resourceVersion: '2367918'
persistentVolumeReclaimPolicy: Delete
storageClassName: hcloud-volumes
volumeMode: Filesystem
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: csi.hetzner.cloud/location
operator: In
values:
- nbg1
this messages happen across different deployments with different applications / services.
thx a lot Martin
Are there any update on this topics?
I've run into this issue as well. Restarting the node where the issue appeared fixed it, except for one volume which is still not working.
Any updates on this? I'm having the same issues
Do you have any updates?
We are also facing this issue. It occurs any time a server has been crashed after trying to add a new pvc. Then we restart the server in the Hetzner Cloud Console. When the server has been started again, this new pvc is listet correctly in the Hetzner Cloud Console and in Kubernetes, but the pvc does not exists in /dev/disk/by-id/
.
same issue - happens only for large volumes. Is there a way to increase the timeout?
same here this driver is shockingly bad and that's one thing, but these threads are mostly ignored as far as I can tell
Same here on K3s v1.23.9+k3s1 Because I cannot check many topics at this moment, like node restart etc. I tried to help myself and create the mount manually into the right folder: (Don't repeat this, without to know the risk of data loss) /var/lib/kubelet/pods/(POD-UUID)/volumes/kubernetes.io~csi/(PVC-Name)/mount Pod pick this up and started correctly. This don't replace a node restart, which I will check later.
For me the PVC was attached to host and available in /dev/disk/by-id/ and /dev/ No errors in hetzer-csi pods or controller, beside the same like in pod:
level=error ts=2022-09-29T09:31:10.425207836Z component=grpc-server msg="handler failed" err="rpc error: code = Internal desc = failed to publish volume: exit status 1\nmke2fs 1.45.7 (28-Jan-2021)\nThe file does not exist and no size was specified.
When I can help, I will do.
We use "hetznercloud/hcloud-csi-driver:latest" and csi-driver-node binary was updated/pulled on 10.08.2022
Update: I'm asking Hetzner Support to look into this ticket. Maybe this helps.
@swarnat I left a similar message on another issue related to this problem. The problem is rke or k3s > it just won't play nice with hetzner csi I've since then switched to vanilla (via kubeone) deployments, and no issues what so ever
Every day an interesting project. Thanks for mentioning kubeone. I never heard from that tool and will check, what we need to adjust to setup vanilla k8s ourself.
Probably it will help somebody: kubeone works as solution, but I cannot forget this problem, because "vitobotta/hetzner-k3s project" was my step into world of Kubernetes. So I investigate the problem a little bit more.
I tried with node restarts, but not helps. Then I manually remove the VolumeAttachment entry of PVC and this solves the situation automatically. I suppose the problem is based within the code, which check, if a VolumeAttachment must be removed/reinitialized. I will try, if I can debug CSI Driver a little bit, but probably this is beyond my knowledge at the moment.
So the manual workaround for me, was:
# example: kubectl get volumeattachment | grep pvc-9a627590-1234-5678-905a-fb6089af008f
kubectl get volumeattachment | grep pvc-id
# example: kubectl delete volumeattachment csi-8e250e6be1c12345678aacf437308cf6650f431a90db43b1a4ce559f46da7ad6
kubectl delete volumeattachment csi-id
Thanks @swarnat , that workaround works.
hcloud-csi-driver level=error ts=2022-12-09T06:24:54.932343015Z component=grpc-server msg="handler failed" err="rpc error: code = Internal desc = failed to publish volume: exit status 1\nmke2fs 1.46.4 (18-Aug-2021)\nThe file does not exist and no size was specified.\n"
Looking at the error, we can see there is an extra space after file in file does not...
. Additionally, the log line component=linux-mount-service msg="formatting disk" disk= fstype=ext4
reveals the disk
variable is empty. It seems the CSI isn't getting the device path.
I had a quick look at the CSI spec, it seems the NodePublishVolume()
request is the thing which is failing. Specifically it looks like the path isn't in the map here https://github.com/hetznercloud/csi-driver/blob/main/driver/node.go#L60.
That map[string]string
is apparently set by ControllerPublishVolume()
:
message NodePublishVolumeRequest {
// The ID of the volume to publish. This field is REQUIRED.
string volume_id = 1;
// The CO SHALL set this field to the value returned by
// `ControllerPublishVolume` if the corresponding Controller Plugin
// has `PUBLISH_UNPUBLISH_VOLUME` controller capability, and SHALL be
// left unset if the corresponding Controller Plugin does not have
// this capability. This is an OPTIONAL field.
map<string, string> publish_context = 2;
This leads to https://github.com/hetznercloud/csi-driver/blob/main/driver/controller.go#L176-L190 which in turn calls https://github.com/hetznercloud/csi-driver/blob/6beac22cec3523a338241ead8a9f6f394c15c666/api/volume.go#L83-L101.
I'd guess the issue is in there somewhere, perhaps the hcloud API doesn't always return a LinuxDevice ?
https://github.com/hetznercloud/csi-driver/pull/343
I'll run my branch for a while and let people know if this resolves the issue.
I am a bit skeptical though, the Attach
API should already be checking if we are attached (as we'd expect it to)
https://github.com/hetznercloud/csi-driver/blob/6beac22cec3523a338241ead8a9f6f394c15c666/api/volume.go#L216-L223
I guess there could be a race condition where the hcloud API reports that the device has been attached, but the underlying Linux server has not yet set up a mount path? Not sure
Seems that if the devicePath
is empty, we should still see the attachmentMetadata
set in the volumeAttachement.status
https://go.dev/play/p/4wBZ3naDYJT
Other than that, not sure why the attachmentMetadata
would be getting blanked out
Thanks for the in-depth research @alam0rt!
Your fix should work and I am going to merge this soon, but I would still like to find out if there is an underlying issue in our API.
It would be great to get some debug logs of the calls happening while the broken VolumeAttachment
is created, for this, you can set the environment variable HCLOUD_DEBUG=true
on container hcloud-csi-driver
in Deployment hcloud-csi-controller
.
If you happen to get some logs of this, you can send them to me via email: julian.toelle <at> hetzner-cloud.de
I'll set it to debug soon, the cluster I am using is just for personal use so I don't bother keeping logs - but will see if I can capture it.
I spent some more time investigating the problems in this issue. I think this issue contains two different problems:
The original problem, where the volume was supposed to be attached according to the api and NodePublishVolume
fails with
MountVolume.MountDevice failed for volume "$PV_NAME" : rpc error: code = Internal desc = failed to stage volume: format of disk "/dev/disk/by-id/scsi-0HC_Volume_$VOLUME_ID" failed: type:("ext4") target:("/var/lib/kubelet/plugins/kubernetes.io/csi/pv/$PV_NAME/globalmount") options:("defaults") errcode:(exit status 1) output:(mke2fs 1.45.7 (28-Jan-2021) The file /dev/disk/by-id/scsi-0HC_Volume_$VOLUME_ID does not exist and no size was specified. )
Unfortunately I could not find a cause for this problem (yet), it would be great if all of the affected users could upgrade to the latest version of the driver and then send me debug logs (see here for debug logs).
The second problem, where the publishing fails because the NodePublishVolume
call is missing a path to the device. This fails with message:
level=error ts=2022-09-29T09:31:10.425207836Z component=grpc-server msg="handler failed" err="rpc error: code = Internal desc = failed to publish volume: exit status 1\nmke2fs 1.45.7 (28-Jan-2021)\nThe file does not exist and no size was specified.
As far as I could tell this happened:
- The cluster was using csi-driver
v1.6.0
- A volume was created and attached to a node
- The csi-driver running in the cluster was upgrade to
v2.0.0+
(orlatest
at some point after 2022-02-15) - The problem occured
We made a change (#264) in the way we get the linux device path of the volume (which is used in NodePublishVolume
). Prior to the linked PR, we retrieved the volume from the API in NodeStageVolume
and used that path to mount the volume. In an effort to remove any API calls from the Node
part, we changed this mechanism to instead pass the device path from ControllerPublishVolume
to NodePublishVolume
using PublishContext
. This PublishContext
is saved on the VolumeAttachment
object in kubernetes.
As we only started setting the PublishContext
for VolumeAttachments created after v2.0.0
(or latest
at some point after 2022-02-15), the fields are missing for older VolumeAttachments.
You can run the following command to find any lingering VolumeAttachments in your clusters that might still be affected by this:
$ kubectl get volumeattachment -o custom-columns=NAME:.metadata.name,PVC:.spec.source.persistentVolumeName,ATTACHER:.spec.attacher,DEVICEPATH:.status.attachmentMetadata.devicePath | grep -E 'ATTACHER|csi.hetzner.cloud' --color=never
NAME PVC ATTACHER DEVICEPATH
csi-052551d52da355ad5159b15ce67803b77154dd33c60d136ec504d59e1c0d1eac pvc-e42ffe07-3eb5-4b24-865a-15a40c24b82b csi.hetzner.cloud /dev/disk/by-id/scsi-0HC_Volume_1234567890
csi-1ee89e51ea7f012153b3d2f117cfa37871286e870a5b2e0c808252968362e3ef pvc-2cca061d-dc80-44f6-87bc-aab963c220d2 csi.hetzner.cloud <none>
If you see any lines with <none>
in the DEVICEPATH column, you need to recreate the VolumeAttachment, the workaround from @swarnat works for this.
The same issue came up for me today after upgrading the csi-driver from 1.6.0
to 2.1.0
according to the upgrade guide.
@apricote's command to find affected attachments was very helpful. It is important to note though that the corresponding pod(s) should be immediately restarted (deleted) afterwards as pv access is lost directly after deleting the faulty volume attachments as described by @swarnat.
Thanks everyone for investigating this!
I have this reproduced constantly on the kubernetes version 1.19. Won't reproduce after upgrade to 1.20+. Big volumes (1TB+) mount successfully.
I am running rook-ceph on top of hcloud volumes to make cross-region PVCs. I just updated our dev cluster from k3s from v1.22.7 to v1.23.15 and saw that hcloud-csi should now also be upgraded…
Which I just did… and to test it I recreated one OSD pod and ran into the issue mentioned here. Deleting the volumeattachment manually solves the first issue… but now I'm stuck at…
level=error ts=2023-01-20T08:11:53.57769856Z component=grpc-server msg="handler failed" err="rpc error: code = Internal desc = failed to publish volume: requested ext4 volume, but disk /dev/disk/by-id/scsi-0HC_Volume_13925705 already is formatted with ceph_bluestore"
Was there any more change about migration from v1 to v2 when regarding the filesystem?
Edit 1: I remember there once was https://github.com/hetznercloud/csi-driver/issues/66 which enabled usage of rook-ceph in the first place… but something makes hcloud-csi think to mount this volume now as ext4?
Edit 2: Uff, I guess it's this #353 ? When's the release for this? :D Asking for a dev cluster.
Edit 3: FYI downgrading to v1.6.0 even in a v1.23 cluster did restore functionality and no data on the rook-ceph osd was lost. So it did not reformat them… but well it should just mount them as block devices. Maybe the next release fixes this… hopefully… until then I'll stay on v1.6.0
Update: upgrade to v2.1.1 fixes the formatting issue.
But during the upgrade from v1.6 to v2.1.1 the volumeattachments have to me manually deleted (at least when a pod using the volume restarts (on the same node?)).
It's not nice, but I hope the next updates will not be so bumpy :D
This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.
This is still happening to me?
This is still happening to me?
yeah, same to me, I'm afraid So in my dev cluster, I just cordon off the metal box when I need to do volume stuff :) Not ideal, but hey...
@aleksasiriski @cristianrat Please check my previous comment https://github.com/hetznercloud/csi-driver/issues/278#issuecomment-1351652816 and find out which exact issue you are experiencing. We still need some debug logs for the original issue, but no one yet send some to us.
@apricote Happy to send you logs, even do a screen sharing session with you and go through the thing if it helps. Just let me know what you need
This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.
remove stale
Hey @korewaChino,
Please check my previous comment https://github.com/hetznercloud/csi-driver/issues/278#issuecomment-1351652816 and find out which exact issue you are experiencing. We still need some debug logs for the original issue, but no one send some to us so far.
This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.