csi-driver icon indicating copy to clipboard operation
csi-driver copied to clipboard

CSI Mount Error on Hetzner

Open martinbe1io opened this issue 2 years ago • 27 comments

Hi all,

hope somebody has a hint for me as I already invested hours without a solution, in the beginning everything with my kuberone cluster on Hetzner worked as expected. auto provisioning of block storage over the csi interface is exactly as it should. Since days across the different namespaces errors occur at initial pod starting after provisioning via helm

the error message at pod startup

MountVolume.MountDevice failed for volume "pvc-a0a20263-9462-4c3e-9ebb-be45b92da7f4" : rpc error: code = Internal desc = failed to stage volume: format of disk "/dev/disk/by-id/scsi-0HC_Volume_16943979" failed: type:("ext4") target:("/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-a0a20263-9462-4c3e-9ebb-be45b92da7f4/globalmount") options:("defaults") errcode:(exit status 1) output:(mke2fs 1.45.7 (28-Jan-2021) The file /dev/disk/by-id/scsi-0HC_Volume_16943979 does not exist and no size was specified. )
Unable to attach or mount volumes: unmounted volumes=[redis-pvc], unattached volumes=[redis-pvc default-token-g7mjq]: timed out waiting for the condition

parts of the deployment file

spec:
  volumes:
    - name: redis-pvc
      persistentVolumeClaim:
        claimName: redis-pvc-chirpstack-redis-0
    - name: default-token-g7mjq
      secret:
        secretName: default-token-g7mjq
        defaultMode: 420


 volumeMounts:
        - name: redis-pvc
          mountPath: /data
        - name: default-token-g7mjq
          readOnly: true
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount

pvc is like this


apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: redis-pvc-chirpstack-redis-0
  namespace: chirpstack
  selfLink: >-
    /api/v1/namespaces/chirpstack/persistentvolumeclaims/redis-pvc-chirpstack-redis-0
  uid: 70520ae2-b99f-4d7f-a625-5e97d1748dd9
  resourceVersion: '2367972'
  creationTimestamp: '2022-02-15T17:44:04Z'
  labels:
    app: redis
    release: chirpstack
  annotations:
    pv.kubernetes.io/bind-completed: 'yes'
    pv.kubernetes.io/bound-by-controller: 'yes'
    volume.beta.kubernetes.io/storage-provisioner: csi.hetzner.cloud
    volume.kubernetes.io/selected-node: oc4-pool1-779fc8f494-whkcs
  finalizers:
    - kubernetes.io/pvc-protection
  managedFields:
    - manager: kube-scheduler
      operation: Update
      apiVersion: v1
      time: '2022-02-15T17:44:04Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:volume.kubernetes.io/selected-node: {}
    - manager: kube-controller-manager
      operation: Update
      apiVersion: v1
      time: '2022-02-15T17:44:07Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            f:pv.kubernetes.io/bind-completed: {}
            f:pv.kubernetes.io/bound-by-controller: {}
            f:volume.beta.kubernetes.io/storage-provisioner: {}
          f:labels:
            .: {}
            f:app: {}
            f:release: {}
        f:spec:
          f:accessModes: {}
          f:resources:
            f:requests:
              .: {}
              f:storage: {}
          f:storageClassName: {}
          f:volumeMode: {}
          f:volumeName: {}
        f:status:
          f:accessModes: {}
          f:capacity:
            .: {}
            f:storage: {}
          f:phase: {}
status:
  phase: Bound
  accessModes:
    - ReadWriteOnce
  capacity:
    storage: 10Gi
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 500M
  volumeName: pvc-70520ae2-b99f-4d7f-a625-5e97d1748dd9
  storageClassName: hcloud-volumes
  volumeMode: Filesystem

pv

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pvc-70520ae2-b99f-4d7f-a625-5e97d1748dd9
  selfLink: /api/v1/persistentvolumes/pvc-70520ae2-b99f-4d7f-a625-5e97d1748dd9
  uid: cc6f8d63-a75f-4954-88d3-4ad5c68ffbc1
  resourceVersion: '2367987'
  creationTimestamp: '2022-02-15T17:44:07Z'
  annotations:
    pv.kubernetes.io/provisioned-by: csi.hetzner.cloud
  finalizers:
    - kubernetes.io/pv-protection
    - external-attacher/csi-hetzner-cloud
  managedFields:
    - manager: csi-provisioner
      operation: Update
      apiVersion: v1
      time: '2022-02-15T17:44:07Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:pv.kubernetes.io/provisioned-by: {}
        f:spec:
          f:accessModes: {}
          f:capacity:
            .: {}
            f:storage: {}
          f:claimRef:
            .: {}
            f:apiVersion: {}
            f:kind: {}
            f:name: {}
            f:namespace: {}
            f:resourceVersion: {}
            f:uid: {}
          f:csi:
            .: {}
            f:driver: {}
            f:fsType: {}
            f:volumeAttributes:
              .: {}
              f:storage.kubernetes.io/csiProvisionerIdentity: {}
            f:volumeHandle: {}
          f:nodeAffinity:
            .: {}
            f:required:
              .: {}
              f:nodeSelectorTerms: {}
          f:persistentVolumeReclaimPolicy: {}
          f:storageClassName: {}
          f:volumeMode: {}
    - manager: kube-controller-manager
      operation: Update
      apiVersion: v1
      time: '2022-02-15T17:44:07Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          f:phase: {}
    - manager: csi-attacher
      operation: Update
      apiVersion: v1
      time: '2022-02-15T17:44:08Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:finalizers:
            v:"external-attacher/csi-hetzner-cloud": {}
status:
  phase: Bound
spec:
  capacity:
    storage: 10Gi
  csi:
    driver: csi.hetzner.cloud
    volumeHandle: '16943978'
    fsType: ext4
    volumeAttributes:
      storage.kubernetes.io/csiProvisionerIdentity: 1644479434542-8081-csi.hetzner.cloud
  accessModes:
    - ReadWriteOnce
  claimRef:
    kind: PersistentVolumeClaim
    namespace: chirpstack
    name: redis-pvc-chirpstack-redis-0
    uid: 70520ae2-b99f-4d7f-a625-5e97d1748dd9
    apiVersion: v1
    resourceVersion: '2367918'
  persistentVolumeReclaimPolicy: Delete
  storageClassName: hcloud-volumes
  volumeMode: Filesystem
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: csi.hetzner.cloud/location
              operator: In
              values:
                - nbg1

this messages happen across different deployments with different applications / services.

thx a lot Martin

martinbe1io avatar Feb 18 '22 12:02 martinbe1io

Are there any update on this topics?

I've run into this issue as well. Restarting the node where the issue appeared fixed it, except for one volume which is still not working.

llama0815 avatar Mar 28 '22 14:03 llama0815

Any updates on this? I'm having the same issues

lukashoffmannhb avatar Jun 02 '22 09:06 lukashoffmannhb

Do you have any updates?

We are also facing this issue. It occurs any time a server has been crashed after trying to add a new pvc. Then we restart the server in the Hetzner Cloud Console. When the server has been started again, this new pvc is listet correctly in the Hetzner Cloud Console and in Kubernetes, but the pvc does not exists in /dev/disk/by-id/.

Neneil94 avatar Jun 02 '22 13:06 Neneil94

same issue - happens only for large volumes. Is there a way to increase the timeout?

InsOpDe avatar Aug 15 '22 12:08 InsOpDe

same here this driver is shockingly bad and that's one thing, but these threads are mostly ignored as far as I can tell

cristianrat avatar Aug 18 '22 20:08 cristianrat

Same here on K3s v1.23.9+k3s1 Because I cannot check many topics at this moment, like node restart etc. I tried to help myself and create the mount manually into the right folder: (Don't repeat this, without to know the risk of data loss) /var/lib/kubelet/pods/(POD-UUID)/volumes/kubernetes.io~csi/(PVC-Name)/mount Pod pick this up and started correctly. This don't replace a node restart, which I will check later.

For me the PVC was attached to host and available in /dev/disk/by-id/ and /dev/ No errors in hetzer-csi pods or controller, beside the same like in pod:

level=error ts=2022-09-29T09:31:10.425207836Z component=grpc-server msg="handler failed" err="rpc error: code = Internal desc = failed to publish volume: exit status 1\nmke2fs 1.45.7 (28-Jan-2021)\nThe file does not exist and no size was specified.

When I can help, I will do.

We use "hetznercloud/hcloud-csi-driver:latest" and csi-driver-node binary was updated/pulled on 10.08.2022

Update: I'm asking Hetzner Support to look into this ticket. Maybe this helps.

swarnat avatar Sep 29 '22 09:09 swarnat

@swarnat I left a similar message on another issue related to this problem. The problem is rke or k3s > it just won't play nice with hetzner csi I've since then switched to vanilla (via kubeone) deployments, and no issues what so ever

cristianrat avatar Sep 29 '22 09:09 cristianrat

Every day an interesting project. Thanks for mentioning kubeone. I never heard from that tool and will check, what we need to adjust to setup vanilla k8s ourself.

swarnat avatar Sep 29 '22 11:09 swarnat

Probably it will help somebody: kubeone works as solution, but I cannot forget this problem, because "vitobotta/hetzner-k3s project" was my step into world of Kubernetes. So I investigate the problem a little bit more.

I tried with node restarts, but not helps. Then I manually remove the VolumeAttachment entry of PVC and this solves the situation automatically. I suppose the problem is based within the code, which check, if a VolumeAttachment must be removed/reinitialized. I will try, if I can debug CSI Driver a little bit, but probably this is beyond my knowledge at the moment.

So the manual workaround for me, was:

# example: kubectl get volumeattachment | grep pvc-9a627590-1234-5678-905a-fb6089af008f
kubectl get volumeattachment | grep pvc-id

# example: kubectl delete volumeattachment csi-8e250e6be1c12345678aacf437308cf6650f431a90db43b1a4ce559f46da7ad6
kubectl delete volumeattachment csi-id

swarnat avatar Oct 10 '22 22:10 swarnat

Thanks @swarnat , that workaround works.

hcloud-csi-driver level=error ts=2022-12-09T06:24:54.932343015Z component=grpc-server msg="handler failed" err="rpc error: code = Internal desc = failed to publish volume: exit status 1\nmke2fs 1.46.4 (18-Aug-2021)\nThe file  does not exist and no size was specified.\n"

Looking at the error, we can see there is an extra space after file in file does not.... Additionally, the log line component=linux-mount-service msg="formatting disk" disk= fstype=ext4 reveals the disk variable is empty. It seems the CSI isn't getting the device path.

I had a quick look at the CSI spec, it seems the NodePublishVolume() request is the thing which is failing. Specifically it looks like the path isn't in the map here https://github.com/hetznercloud/csi-driver/blob/main/driver/node.go#L60.

That map[string]string is apparently set by ControllerPublishVolume():

message NodePublishVolumeRequest {
  // The ID of the volume to publish. This field is REQUIRED.
  string volume_id = 1;

  // The CO SHALL set this field to the value returned by
  // `ControllerPublishVolume` if the corresponding Controller Plugin
  // has `PUBLISH_UNPUBLISH_VOLUME` controller capability, and SHALL be
  // left unset if the corresponding Controller Plugin does not have
  // this capability. This is an OPTIONAL field.
  map<string, string> publish_context = 2;

This leads to https://github.com/hetznercloud/csi-driver/blob/main/driver/controller.go#L176-L190 which in turn calls https://github.com/hetznercloud/csi-driver/blob/6beac22cec3523a338241ead8a9f6f394c15c666/api/volume.go#L83-L101.

I'd guess the issue is in there somewhere, perhaps the hcloud API doesn't always return a LinuxDevice ?

alam0rt avatar Dec 09 '22 21:12 alam0rt

https://github.com/hetznercloud/csi-driver/pull/343

I'll run my branch for a while and let people know if this resolves the issue.

I am a bit skeptical though, the Attach API should already be checking if we are attached (as we'd expect it to) https://github.com/hetznercloud/csi-driver/blob/6beac22cec3523a338241ead8a9f6f394c15c666/api/volume.go#L216-L223

I guess there could be a race condition where the hcloud API reports that the device has been attached, but the underlying Linux server has not yet set up a mount path? Not sure

Seems that if the devicePath is empty, we should still see the attachmentMetadata set in the volumeAttachement.status https://go.dev/play/p/4wBZ3naDYJT

alam0rt avatar Dec 09 '22 23:12 alam0rt

Other than that, not sure why the attachmentMetadata would be getting blanked out

alam0rt avatar Dec 10 '22 00:12 alam0rt

Thanks for the in-depth research @alam0rt!

Your fix should work and I am going to merge this soon, but I would still like to find out if there is an underlying issue in our API.

It would be great to get some debug logs of the calls happening while the broken VolumeAttachment is created, for this, you can set the environment variable HCLOUD_DEBUG=true on container hcloud-csi-driver in Deployment hcloud-csi-controller.

If you happen to get some logs of this, you can send them to me via email: julian.toelle <at> hetzner-cloud.de

apricote avatar Dec 13 '22 11:12 apricote

I'll set it to debug soon, the cluster I am using is just for personal use so I don't bother keeping logs - but will see if I can capture it.

alam0rt avatar Dec 14 '22 00:12 alam0rt

I spent some more time investigating the problems in this issue. I think this issue contains two different problems:


The original problem, where the volume was supposed to be attached according to the api and NodePublishVolume fails with

MountVolume.MountDevice failed for volume "$PV_NAME" : rpc error: code = Internal desc = failed to stage volume: format of disk "/dev/disk/by-id/scsi-0HC_Volume_$VOLUME_ID" failed: type:("ext4") target:("/var/lib/kubelet/plugins/kubernetes.io/csi/pv/$PV_NAME/globalmount") options:("defaults") errcode:(exit status 1) output:(mke2fs 1.45.7 (28-Jan-2021) The file /dev/disk/by-id/scsi-0HC_Volume_$VOLUME_ID does not exist and no size was specified. )

Unfortunately I could not find a cause for this problem (yet), it would be great if all of the affected users could upgrade to the latest version of the driver and then send me debug logs (see here for debug logs).


The second problem, where the publishing fails because the NodePublishVolume call is missing a path to the device. This fails with message:

level=error ts=2022-09-29T09:31:10.425207836Z component=grpc-server msg="handler failed" err="rpc error: code = Internal desc = failed to publish volume: exit status 1\nmke2fs 1.45.7 (28-Jan-2021)\nThe file does not exist and no size was specified.

As far as I could tell this happened:

  • The cluster was using csi-driver v1.6.0
  • A volume was created and attached to a node
  • The csi-driver running in the cluster was upgrade to v2.0.0+ (or latest at some point after 2022-02-15)
  • The problem occured

We made a change (#264) in the way we get the linux device path of the volume (which is used in NodePublishVolume). Prior to the linked PR, we retrieved the volume from the API in NodeStageVolume and used that path to mount the volume. In an effort to remove any API calls from the Node part, we changed this mechanism to instead pass the device path from ControllerPublishVolume to NodePublishVolume using PublishContext. This PublishContext is saved on the VolumeAttachment object in kubernetes.

As we only started setting the PublishContext for VolumeAttachments created after v2.0.0 (or latest at some point after 2022-02-15), the fields are missing for older VolumeAttachments.

You can run the following command to find any lingering VolumeAttachments in your clusters that might still be affected by this:

$ kubectl get volumeattachment -o custom-columns=NAME:.metadata.name,PVC:.spec.source.persistentVolumeName,ATTACHER:.spec.attacher,DEVICEPATH:.status.attachmentMetadata.devicePath | grep -E 'ATTACHER|csi.hetzner.cloud' --color=never
NAME                                                                   PVC                                        ATTACHER            DEVICEPATH
csi-052551d52da355ad5159b15ce67803b77154dd33c60d136ec504d59e1c0d1eac   pvc-e42ffe07-3eb5-4b24-865a-15a40c24b82b   csi.hetzner.cloud   /dev/disk/by-id/scsi-0HC_Volume_1234567890
csi-1ee89e51ea7f012153b3d2f117cfa37871286e870a5b2e0c808252968362e3ef   pvc-2cca061d-dc80-44f6-87bc-aab963c220d2   csi.hetzner.cloud   <none>

If you see any lines with <none> in the DEVICEPATH column, you need to recreate the VolumeAttachment, the workaround from @swarnat works for this.

apricote avatar Dec 14 '22 15:12 apricote

The same issue came up for me today after upgrading the csi-driver from 1.6.0 to 2.1.0 according to the upgrade guide.

@apricote's command to find affected attachments was very helpful. It is important to note though that the corresponding pod(s) should be immediately restarted (deleted) afterwards as pv access is lost directly after deleting the faulty volume attachments as described by @swarnat.

Thanks everyone for investigating this!

bvehse avatar Jan 05 '23 17:01 bvehse

I have this reproduced constantly on the kubernetes version 1.19. Won't reproduce after upgrade to 1.20+. Big volumes (1TB+) mount successfully.

enterdv avatar Jan 19 '23 09:01 enterdv

I am running rook-ceph on top of hcloud volumes to make cross-region PVCs. I just updated our dev cluster from k3s from v1.22.7 to v1.23.15 and saw that hcloud-csi should now also be upgraded…

Which I just did… and to test it I recreated one OSD pod and ran into the issue mentioned here. Deleting the volumeattachment manually solves the first issue… but now I'm stuck at…

level=error ts=2023-01-20T08:11:53.57769856Z component=grpc-server msg="handler failed" err="rpc error: code = Internal desc = failed to publish volume: requested ext4 volume, but disk /dev/disk/by-id/scsi-0HC_Volume_13925705 already is formatted with ceph_bluestore"

Was there any more change about migration from v1 to v2 when regarding the filesystem?

Edit 1: I remember there once was https://github.com/hetznercloud/csi-driver/issues/66 which enabled usage of rook-ceph in the first place… but something makes hcloud-csi think to mount this volume now as ext4?

Edit 2: Uff, I guess it's this #353 ? When's the release for this? :D Asking for a dev cluster.

Edit 3: FYI downgrading to v1.6.0 even in a v1.23 cluster did restore functionality and no data on the rook-ceph osd was lost. So it did not reformat them… but well it should just mount them as block devices. Maybe the next release fixes this… hopefully… until then I'll stay on v1.6.0

toabi avatar Jan 20 '23 08:01 toabi

Update: upgrade to v2.1.1 fixes the formatting issue.

But during the upgrade from v1.6 to v2.1.1 the volumeattachments have to me manually deleted (at least when a pod using the volume restarts (on the same node?)).

It's not nice, but I hope the next updates will not be so bumpy :D

toabi avatar Jan 23 '23 15:01 toabi

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.

github-actions[bot] avatar Apr 24 '23 12:04 github-actions[bot]

This is still happening to me?

aleksasiriski avatar Sep 19 '23 18:09 aleksasiriski

This is still happening to me?

yeah, same to me, I'm afraid So in my dev cluster, I just cordon off the metal box when I need to do volume stuff :) Not ideal, but hey...

cristianrat avatar Sep 19 '23 19:09 cristianrat

@aleksasiriski @cristianrat Please check my previous comment https://github.com/hetznercloud/csi-driver/issues/278#issuecomment-1351652816 and find out which exact issue you are experiencing. We still need some debug logs for the original issue, but no one yet send some to us.

apricote avatar Sep 20 '23 05:09 apricote

@apricote Happy to send you logs, even do a screen sharing session with you and go through the thing if it helps. Just let me know what you need

cristianrat avatar Sep 20 '23 06:09 cristianrat

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.

github-actions[bot] avatar Dec 19 '23 12:12 github-actions[bot]

remove stale

korewaChino avatar Dec 19 '23 13:12 korewaChino

Hey @korewaChino,

Please check my previous comment https://github.com/hetznercloud/csi-driver/issues/278#issuecomment-1351652816 and find out which exact issue you are experiencing. We still need some debug logs for the original issue, but no one send some to us so far.

apricote avatar Jan 15 '24 09:01 apricote

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.

github-actions[bot] avatar Apr 14 '24 14:04 github-actions[bot]