csi-driver icon indicating copy to clipboard operation
csi-driver copied to clipboard

Random volume mount error

Open nmass-betpawa opened this issue 3 years ago • 17 comments

This error follows me for the last year.

tax-mysql-0                                       0/2     Init:0/1   0          3m5s
Warning  FailedAttachVolume  119s   attachdetach-controller  AttachVolume.Attach failed for volume "pvc-b269c782-6c14-41d8-a043-9b8c890e485a" : rpc error: code = Internal desc = failed to publish volume: Volume assigning step has failed due to an unknown error. (unknown_error)

Pod randomly can't create volume mount. And stays in Init state forever. Sometimes cloud node on which pod was running goes offline, but not always. I need to go to Hetzner console and turn node on manually. Workaround is deleting pod and wait for another try. csi version 1.6.0 kubernetes 1.19.9

nmass-betpawa avatar Feb 21 '22 10:02 nmass-betpawa

I am getting the same error. Restarting the pod does not help.

Volume is beeing bound and created but cannot be attached. Only applied to one pod in a stateful set of 4.

ditschedev avatar Mar 01 '22 10:03 ditschedev

@nmass-betpawa Hard to tell. Can you provide logs of the relevant pods (hcloud-csi-node-xxx and hcloud-csi-controller-xxx)?

choffmeister avatar Mar 01 '22 15:03 choffmeister

kubectl -n kube-system logs hcloud-csi-controller-0 -c hcloud-csi-driver
level=info ts=2022-03-04T18:27:07.770537039Z component=api-volume-service msg="attaching volume" volume-id=17427653 server-id=13700044
level=info ts=2022-03-04T18:27:07.771223989Z component=api-volume-service msg="attaching volume" volume-id=17427671 server-id=13700044
level=info ts=2022-03-04T18:27:12.805624266Z component=api-volume-service msg="failed to attach volume" volume-id=17427671 server-id=13700044 err="cannot perform operation because server is locked (locked)"
level=error ts=2022-03-04T18:27:12.863666053Z component=grpc-server msg="handler failed" err="rpc error: code = Unavailable desc = failed to publish volume: server is locked"
level=info ts=2022-03-04T18:27:17.336704814Z component=api-volume-service msg="failed to attach volume" volume-id=17427653 server-id=13700044 err="Volume assigning step has failed due to an unknown error. (unknown_error)"
level=error ts=2022-03-04T18:27:17.390385673Z component=grpc-server msg="handler failed" err="rpc error: code = Internal desc = failed to publish volume: Volume assigning step has failed due to an unknown error. (unknown_error)"

kubectl -n kube-system logs hcloud-csi-controller-0 -c csi-provisioner | grep pvc-61295e91-e439-4423-a91d-f1794c1c3f3c
I0304 17:49:15.441626       1 controller.go:1439] provision "pr-104/admin-managment-database" class "hcloud-volumes": volume "pvc-61295e91-e439-4423-a91d-f1794c1c3f3c" provisioned
I0304 17:49:16.347715       1 controller.go:1341] provision "pr-104/admin-managment-database" class "hcloud-volumes": persistentvolume "pvc-61295e91-e439-4423-a91d-f1794c1c3f3c" already exists, skipping
I0304 17:49:16.359257       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"pr-104", Name:"admin-managment-database", UID:"61295e91-e439-4423-a91d-f1794c1c3f3c", APIVersion:"v1", ResourceVersion:"608162803", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-61295e91-e439-4423-a91d-f1794c1c3f3c

no logs here

kubectl -n kube-system logs hcloud-csi-node-q85tq -c hcloud-csi-driver
level=info ts=2022-03-04T18:17:31.125716915Z msg="fetched server" server-name=kube-node-40-qa

kubectl -n kube-system logs hcloud-csi-node-q85tq -c csi-node-driver-registrar
I0304 18:17:29.334885       1 main.go:113] Version: v2.2.0
I0304 18:17:31.242022       1 node_register.go:52] Starting Registration Server at: /registration/csi.hetzner.cloud-reg.sock
I0304 18:17:31.242615       1 node_register.go:61] Registration Server started at: /registration/csi.hetzner.cloud-reg.sock
I0304 18:17:31.242856       1 node_register.go:83] Skipping healthz server because HTTP endpoint is set to: ""
I0304 18:17:31.908776       1 main.go:80] Received GetInfo call: &InfoRequest{}
I0304 18:17:31.958822       1 main.go:90] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
image

after node reboot through hetzner web interface everything works(but reboot from linux console doesn't help)

nmass-betpawa avatar Mar 04 '22 18:03 nmass-betpawa

Hetzner said its a known error :)

scasei avatar Mar 07 '22 14:03 scasei

Can we assume, that this mounting error is not csi-driver related but a general volume attach problem of Hetzner Cloud? Has one tried, when csi-driver is unable to attach the volume, to manually attach the volume via the Hetzner Console or using hcloud cli tool?

choffmeister avatar Mar 07 '22 14:03 choffmeister

Has one tried, when csi-driver is unable to attach the volume, to manually attach the volume via the Hetzner Console

Yes, tried, works

scasei avatar Mar 07 '22 15:03 scasei

Has one tried, when csi-driver is unable to attach the volume, to manually attach the volume via the Hetzner Console

Yes, tried, works

@scasei Via hcloud cli tool as well? The cli tool (as it talks via the API as well) should do what csi-driver does while changes via the web console might work differently internally. Just want to find out if the issue is in the API itself or in how csi-driver uses the API

choffmeister avatar Mar 07 '22 15:03 choffmeister

@choffmeister only via console.hetzner.cloud, had to detach all volumes from vm, restart server and attach them manually again.

scasei avatar Mar 07 '22 16:03 scasei

Sometimes node gets turned off after failed volume mount, from logs it seems like powerloss, with no error in syslog at all. So I need to go to hetzner web console and turn it on manually. Indeed maybe it is related to virtualization errors somehow. But this volume mount problem happens very often.

nmass-betpawa avatar Mar 16 '22 16:03 nmass-betpawa

Seems to be the same problem as #6 , which shouldn't have been closed.

ThomasSteinbach avatar Mar 31 '22 21:03 ThomasSteinbach

Experience the same issue on our clusters here. Are there any updates to fix this issue? Rebooting the nodes is only a work around.

dcardellino avatar May 03 '22 08:05 dcardellino

I also have random errors if i try to mount hetzner volumes. Only reboots are a workaround. This situation is very critical to us.

vanlueckn avatar May 03 '22 14:05 vanlueckn

Found another workaround:

  • Delete volumeattachments.storage.k8s.io which are not attached and does match your pvc which can't be attached to your pod
  • Delete related pod and wait to be recreated.

This fixed the issue for me temporarily

dcardellino avatar May 03 '22 14:05 dcardellino

Or try killing/deleting the csi-node-pods, wait till they start again, kill/delete the pod which has the mount error

scasei avatar May 03 '22 14:05 scasei

none of the soft workarounds seem to work for me :(

bpesics avatar Jun 28 '22 09:06 bpesics

lots of work arounds here but no fixes I'm also getting this issue

cristianrat avatar Aug 18 '22 19:08 cristianrat

my issue was related to using rke2, after switching to a different flavour (used kubeone to set it up for me), I have no more issues.

cristianrat avatar Sep 03 '22 15:09 cristianrat

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.

github-actions[bot] avatar Dec 03 '22 12:12 github-actions[bot]

I would like to take a look at this, but I will need some debug logs. It would be great if you can activate the debug logs and then send me the sections where this bug happens to my email julian.toelle <at> hetzner-cloud.de

apricote avatar Dec 15 '22 09:12 apricote

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.

github-actions[bot] avatar Apr 18 '23 12:04 github-actions[bot]

@apricote thanks for offering to help!

We have similar issues. Sometimes volumes take forever to mount to the pod which results in startup times of 10min+:

  Normal   Scheduled    11m                    default-scheduler  Successfully assigned pod/pod-54dc234c3-t3t54 to node-name
  Warning  FailedMount  5m20s (x2 over 7m34s)  kubelet            Unable to attach or mount volumes: unmounted volumes=[pod-cache-volume], unattached volumes=[pod-config-volume pod-cache-volume kube-api-access-t441t]: timed out waiting for the condition
  Warning  FailedMount  51s (x3 over 9m49s)    kubelet            Unable to attach or mount volumes: unmounted volumes=[pod-cache-volume], unattached volumes=[pod-cache-volume kube-api-access-t441t pod-config-volume]: timed out waiting for the condition
  Normal   Pulling      5s                     kubelet            Pulling image "image"
  Normal   Pulled       4s                     kubelet            Successfully pulled image "image" in 570.451336ms (570.476615ms including waiting)
  Normal   Created      4s                     kubelet            Created container pod
  Normal   Started      4s                     kubelet            Started container pod

The volume cannot be mounted but according to Hetzner CSI Controller the disk is attached, opened (LUKS) and ready.

Sometimes we also get errors, that state that the server is locked.

  Warning  FailedAttachVolume      2m59s (x2 over 3m1s)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-233f32ff-78cd-c884-234c-f9898a9f8989" : rpc error: code = Unavailable desc = failed to publish volume: server is locked
  Normal   SuccessfulAttachVolume  2m58s                 attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-233f32ff-78cd-c884-234c-f9898a9f8989"
  Warning  FailedMount             40s (x2 over 2m54s)   kubelet                  Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[data kube-api-access-t441t]: timed out waiting for the condition

This is very annoying and we would be very happy to assist finding the issue!

We already had a look into our debug logs and were not able to find any issues there. These logs are generated close to pod scheduling:

hcloud-csi-driver level=debug component=grpc-server msg="handling request" method=/csi.v1.Node/NodeUnpublishVolume req="volume_id:\"123456789\" target_path:\"/var/lib/kubelet/pods/podID/volumes/kubernetes.io~csi/pvcID/mount\" "
hcloud-csi-driver level=info component=linux-mount-service msg="unpublishing volume" target-path=/var/lib/kubelet/pods/podID/volumes/kubernetes.io~csi/pvcID/mount device-path=/dev/mapper/scsi-0HC_Volume_123456789
hcloud-csi-driver 1 mount_helper_common.go:133] Warning: "/var/lib/kubelet/pods/podID/volumes/kubernetes.io~csi/pvcID/mount" is not a mountpoint, deleting
hcloud-csi-driver level=info component=linux-mount-service msg="closing LUKS device" luksDeviceName=scsi-0HC_Volume_123456789
hcloud-csi-driver level=debug component=grpc-server msg="finished handling request" method=/csi.v1.Node/NodeUnpublishVolume err=null
hcloud-csi-driver level=debug grpc-server msg="handling request" method=/csi.v1.Node/NodePublishVolume req="volume_id:\"123456789\" publish_context:<key:\"devicePath\" value:\"/dev/disk/by-id/scsi-0HC_Volume_123456789\" > target_path:\"/var/lib/kubelet/pods/podID/volumes/kubernetes.io~csi/pvcID/mount\" volume_capability:<mount:<fs_type:\"ext4\" > access_mode:<mode:SINGLE_NODE_WRITER > > secrets:<key:\"encryption-passphrase\" value:\"XXXXXXXXXXXXXX\" > volume_context:<key:\"csi.storage.k8s.io/ephemeral\" value:\"false\" > volume_context:<key:\"csi.storage.k8s.io/pod.name\" value:\"podName\" > volume_context:<key:\"csi.storage.k8s.io/pod.namespace\" value:\"namespace\" > volume_context:<key:\"csi.storage.k8s.io/pod.uid\" value:\"podID\" > volume_context:<key:\"csi.storage.k8s.io/serviceAccount.name\" value:\"serviceAccountName\" > volume_context:<key:\"storage.kubernetes.io/csiProvisionerIdentity\" value:\"id-csi.hetzner.cloud\" > "
hcloud-csi-driver level=info linux-mount-service msg="opening LUKS device" devicePath=/dev/disk/by-id/scsi-0HC_Volume_123456789 luksDeviceName=scsi-0HC_Volume_123456789
hcloud-csi-driver level=info linux-mount-service msg="publishing volume" target-path=/var/lib/kubelet/pods/podID/volumes/kubernetes.io~csi/pvcID/mount device-path=/dev/mapper/scsi-0HC_Volume_123456789 fs-type=ext4 block-volume=false readonly=false mount-options= encrypted=true
hcloud-csi-driver level=debug grpc-server msg="finished handling request" method=/csi.v1.Node/NodePublishVolume err=null

Then we see nothing for a long time (thats probably the time during which the pod isn't able to mount the volume). At some point, when the volume was successfully mounted, we see these logs:

hcloud-csi-node-j5vzl hcloud-csi-driver level=debug component=grpc-server msg="handling request" method=/csi.v1.Node/NodeGetVolumeStats req="volume_id:\"0HC_Volume_123456789\" volume_path:\"/var/lib/kubelet/pods/podID/volumes/kubernetes.io~csi/pvcID/mount\" "
hcloud-csi-node-j5vzl hcloud-csi-driver level=debug component=linux-mount-service msg="checking path existence" path=/var/lib/kubelet/pods/podID/volumes/kubernetes.io~csi/pvcID/mount
hcloud-csi-node-j5vzl hcloud-csi-driver level=debug component=grpc-server msg="finished handling request" method=/csi.v1.Node/NodeGetVolumeStats err=null

We don't know what changed or what made the volume suddenly mountable.

ArchdukeNavaron avatar Apr 26 '23 12:04 ArchdukeNavaron

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.

github-actions[bot] avatar Jul 25 '23 12:07 github-actions[bot]