ceph-csi
ceph-csi copied to clipboard
rbd-nbd: occasionally file system entering into read-only on nodeplugin restart
Requires below PR from ceph:
- [ ] https://github.com/ceph/ceph/pull/42609
[1mSTEP[0m: restart rbd-nbd process on nodeplugin and continue IO after nodeplugin restart
Jun 23 15:02:27.261: INFO: Running '/usr/bin/kubectl --server=https://192.168.39.63:8443 --kubeconfig=/root/.kube/config --namespace=cephcsi-e2e-aec55366 --namespace=cephcsi-e2e-aec55366 delete -f -'
Jun 23 15:02:27.387: INFO: stderr: "warning: deleting cluster-scoped resources, not scoped to the provided namespace\n"
Jun 23 15:02:27.387: INFO: stdout: "storageclass.storage.k8s.io \"csi-rbd-sc\" deleted\n"
Jun 23 15:02:27.394: INFO: ExecWithOptions {Command:[/bin/sh -c ceph fsid] Namespace:rook-ceph PodName:rook-ceph-tools-7467d8bf8-vjzwh ContainerName:rook-ceph-tools Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Jun 23 15:02:27.394: INFO: >>> kubeConfig: /root/.kube/config
Jun 23 15:02:29.405: INFO: Waiting up to &PersistentVolumeClaim{ObjectMeta:{rbd-pvc rbd-3274 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:PersistentVolumeClaimSpec{AccessModes:[ReadWriteOnce],Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{storage: {{1073741824 0} {<nil>} 1Gi BinarySI},},},VolumeName:,Selector:nil,StorageClassName:*csi-rbd-sc,VolumeMode:nil,DataSource:nil,},Status:PersistentVolumeClaimStatus{Phase:,AccessModes:[],Capacity:ResourceList{},Conditions:[]PersistentVolumeClaimCondition{},},} to be in Bound state
Jun 23 15:02:29.406: INFO: waiting for PVC rbd-pvc (0 seconds elapsed)
Jun 23 15:02:31.409: INFO: waiting for PVC rbd-pvc (2 seconds elapsed)
Jun 23 15:02:31.415: INFO: Waiting for PV pvc-42f62cbf-9e2f-4f7a-99e1-98768ebfb561 to bind to PVC rbd-pvc
Jun 23 15:02:31.415: INFO: Waiting up to 3m0s for PersistentVolumeClaims [rbd-pvc] to have phase Bound
Jun 23 15:02:31.418: INFO: PersistentVolumeClaim rbd-pvc found and phase=Bound (2.553576ms)
Jun 23 15:02:31.418: INFO: Waiting up to 3m0s for PersistentVolume pvc-42f62cbf-9e2f-4f7a-99e1-98768ebfb561 to have phase Bound
Jun 23 15:02:31.421: INFO: PersistentVolume pvc-42f62cbf-9e2f-4f7a-99e1-98768ebfb561 found and phase=Bound (2.800294ms)
Jun 23 15:02:31.434: INFO: Waiting up to csi-rbd-demo-pod to be in Running state
Jun 23 15:02:31.451: INFO: csi-rbd-demo-pod app is in Pending phase expected to be in Running state (0 seconds elapsed)
Jun 23 15:02:33.460: INFO: csi-rbd-demo-pod app is in Pending phase expected to be in Running state (2 seconds elapsed)
Jun 23 15:02:35.455: INFO: csi-rbd-demo-pod app is in Pending phase expected to be in Running state (4 seconds elapsed)
Jun 23 15:02:37.456: INFO: csi-rbd-demo-pod app is in Pending phase expected to be in Running state (6 seconds elapsed)
Jun 23 15:02:39.455: INFO: csi-rbd-demo-pod app is in Pending phase expected to be in Running state (8 seconds elapsed)
Jun 23 15:02:41.456: INFO: csi-rbd-demo-pod app is in Pending phase expected to be in Running state (10 seconds elapsed)
Jun 23 15:02:43.462: INFO: ExecWithOptions {Command:[/bin/sh -c rbd ls --format=json --pool=replicapool] Namespace:rook-ceph PodName:rook-ceph-tools-7467d8bf8-vjzwh ContainerName:rook-ceph-tools Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Jun 23 15:02:43.462: INFO: >>> kubeConfig: /root/.kube/config
Jun 23 15:02:44.297: INFO: LabelSelector for csi-rbdplugin daemonsets in namespace cephcsi-e2e-aec55366: app=csi-rbdplugin
Jun 23 15:02:44.301: INFO: ExecWithOptions {Command:[/bin/sh -c uname -a] Namespace:cephcsi-e2e-aec55366 PodName:csi-rbdplugin-bvszc ContainerName:csi-rbdplugin Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Jun 23 15:02:44.301: INFO: >>> kubeConfig: /root/.kube/config
Jun 23 15:02:44.497: INFO: uname -a: Linux minikube 4.19.182 #1 SMP Wed Jun 9 00:54:54 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Jun 23 15:02:44.504: INFO: ExecWithOptions {Command:[/bin/sh -c rpm -qa | grep rbd-nbd] Namespace:cephcsi-e2e-aec55366 PodName:csi-rbdplugin-bvszc ContainerName:csi-rbdplugin Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Jun 23 15:02:44.504: INFO: >>> kubeConfig: /root/.kube/config
Jun 23 15:02:45.245: INFO: rbd-nbd package version: rbd-nbd-16.2.4-0.el8.x86_64
Jun 23 15:02:45.250: INFO: ExecWithOptions {Command:[/bin/sh -c ps -eo 'cmd' | grep [r]bd-nbd] Namespace:cephcsi-e2e-aec55366 PodName:csi-rbdplugin-bvszc ContainerName:csi-rbdplugin Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Jun 23 15:02:45.250: INFO: >>> kubeConfig: /root/.kube/config
Jun 23 15:02:45.485: INFO: map command running before restart, mapCmd: /usr/bin/rbd-nbd --id cephcsi-rbd-node -m rook-ceph-mon-a.rook-ceph.svc.cluster.local:6789 --keyfile=/tmp/csi/keys/keyfile-113061691 map replicapool/csi-vol-a5bccd9c-d42b-11eb-991e-4664e1c21ce8 --try-netlink --reattach-timeout=180
Jun 23 15:02:45.495: INFO: ExecWithOptions {Command:[/bin/sh -c ceph auth get-key client.cephcsi-rbd-node] Namespace:rook-ceph PodName:rook-ceph-tools-7467d8bf8-vjzwh ContainerName:rook-ceph-tools Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Jun 23 15:02:45.495: INFO: >>> kubeConfig: /root/.kube/config
Jun 23 15:02:47.629: INFO: Running '/usr/bin/kubectl --server=https://192.168.39.63:8443 --kubeconfig=/root/.kube/config --namespace=cephcsi-e2e-aec55366 delete po -l app=csi-rbdplugin --ignore-not-found=false'
Jun 23 15:02:57.864: INFO: stderr: ""
Jun 23 15:02:57.864: INFO: stdout: "pod \"csi-rbdplugin-bvszc\" deleted\n"
Jun 23 15:02:57.864: INFO: Waiting up to 10m0s for all daemonsets in namespace 'cephcsi-e2e-aec55366' to start
Jun 23 15:02:57.884: INFO: 0 / 1 pods ready in namespace 'cephcsi-e2e-aec55366' in daemonset 'csi-rbdplugin' (0 seconds elapsed)
Jun 23 15:02:59.888: INFO: 0 / 1 pods ready in namespace 'cephcsi-e2e-aec55366' in daemonset 'csi-rbdplugin' (2 seconds elapsed)
Jun 23 15:03:01.889: INFO: 1 / 1 pods ready in namespace 'cephcsi-e2e-aec55366' in daemonset 'csi-rbdplugin' (4 seconds elapsed)
Jun 23 15:03:01.889: INFO: attach command to run after restart, attachCmd: /usr/bin/rbd-nbd --id cephcsi-rbd-node -m rook-ceph-mon-a.rook-ceph.svc.cluster.local:6789 --keyfile=/tmp/csi/keys/keyfile-test attach --device /dev/nbd0 replicapool/csi-vol-a5bccd9c-d42b-11eb-991e-4664e1c21ce8 --try-netlink --reattach-timeout=180
Jun 23 15:03:01.893: INFO: ExecWithOptions {Command:[/bin/sh -c echo AQD+PdNgc4hbCRAAIquzYZxtb8XiMs9fGYO/BA== > /tmp/csi/keys/keyfile-test] Namespace:cephcsi-e2e-aec55366 PodName:csi-rbdplugin-8kqnl ContainerName:csi-rbdplugin Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Jun 23 15:03:01.893: INFO: >>> kubeConfig: /root/.kube/config
Jun 23 15:03:02.072: INFO: ExecWithOptions {Command:[/bin/sh -c /usr/bin/rbd-nbd --id cephcsi-rbd-node -m rook-ceph-mon-a.rook-ceph.svc.cluster.local:6789 --keyfile=/tmp/csi/keys/keyfile-test attach --device /dev/nbd0 replicapool/csi-vol-a5bccd9c-d42b-11eb-991e-4664e1c21ce8 --try-netlink --reattach-timeout=180
] Namespace:cephcsi-e2e-aec55366 PodName:csi-rbdplugin-8kqnl ContainerName:csi-rbdplugin Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Jun 23 15:03:02.072: INFO: >>> kubeConfig: /root/.kube/config
Jun 23 15:03:02.426: INFO: ExecWithOptions {Command:[/bin/sh -c ps -eo 'cmd' | grep [r]bd-nbd] Namespace:cephcsi-e2e-aec55366 PodName:csi-rbdplugin-8kqnl ContainerName:csi-rbdplugin Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Jun 23 15:03:02.426: INFO: >>> kubeConfig: /root/.kube/config
Jun 23 15:03:02.638: INFO: attach command running after restart, runningAttachCmd: /usr/bin/rbd-nbd --id cephcsi-rbd-node -m rook-ceph-mon-a.rook-ceph.svc.cluster.local:6789 --keyfile=/tmp/csi/keys/keyfile-test attach --device /dev/nbd0 replicapool/csi-vol-a5bccd9c-d42b-11eb-991e-4664e1c21ce8 --try-netlink --reattach-timeout=180
Jun 23 15:03:02.643: INFO: ExecWithOptions {Command:[/bin/sh -c echo 'Hello World' > /var/lib/www/html/test] Namespace:rbd-3274 PodName:csi-rbd-demo-pod ContainerName:web-server Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Jun 23 15:03:02.643: INFO: >>> kubeConfig: /root/.kube/config
Jun 23 15:03:18.124: INFO: stdErr occurred: /bin/sh: 1: cannot create /var/lib/www/html/test: Read-only file system
Jun 23 15:03:18.125: FAIL: failed to write IO, err: command terminated with exit code 2, stdErr: /bin/sh: 1: cannot create /var/lib/www/html/test: Read-only file system
failed at https://github.com/ceph/ceph-csi/pull/2095 Logs at https://jenkins-ceph-csi.apps.ocp.ci.centos.org/blue/rest/organizations/jenkins/pipelines/mini-e2e_k8s-1.20/runs/1270/nodes/87/steps/90/log/?start=0
@pkalever PTAL
Ack! Thanks for reporting Madhu, I have also encountered this before, but looks like it's not hit very frequently.
I will investigate on the root cause and get back.
PS: once the volume-healer patches are merged, probably we don't see this!
Sent a PR to rbd-nbd: https://github.com/ceph/ceph/pull/42609
problem was hit in https://github.com/ceph/ceph-csi/pull/2341#issuecomment-892565234 too
problem hit again https://github.com/ceph/ceph-csi/pull/2322#issuecomment-893155681
problem hit again #2322 (comment)
[ Copying Comment from the other thread, for context:]
/test ci/centos/mini-e2e/k8s-1.21
`Aug 4 20:23:22.683: INFO: >>> kubeConfig: /root/.kube/config Aug 4 20:23:23.708: INFO: stdErr occurred: /bin/sh: 1: cannot create /var/lib/www/html/test: Read-only file system
Aug 4 20:23:23.708: FAIL: failed to write IO, err: command terminated with exit code 2, stdErr: /bin/sh: 1: cannot create /var/lib/www/html/test: Read-only file system ` https://jenkins-ceph-csi.apps.ocp.ci.centos.org/blue/rest/organizations/jenkins/pipelines/mini-e2e_k8s-1.21/runs/624/nodes/87/steps/90/log/?start=0
I see we are hitting Read-only even after sync is run successfully:
Aug 4 20:22:52.334: INFO: ExecWithOptions {Command:[/bin/sh -c sync /var/lib/www/html] Namespace:rbd-694 PodName:csi-rbd-demo-pod ContainerName:web-server Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Aug 4 20:22:52.515: INFO: ExecWithOptions {Command:[/bin/sh -c rbd ls --format=json --pool=replicapool] Namespace:rook-ceph PodName:rook-ceph-tools-7467d8bf8-9dm44 ContainerName:rook-ceph-tools Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Aug 4 20:22:53.344: INFO: LabelSelector for csi-rbdplugin daemonsets in namespace cephcsi-e2e-b3b077f5: app=csi-rbdplugin
Aug 4 20:22:53.344: INFO: Running '/usr/bin/kubectl --server=https://192.168.39.165:8443 --kubeconfig=/root/.kube/config --namespace=cephcsi-e2e-b3b077f5 delete po -l app=csi-rbdplugin --ignore-not-found=false'
Aug 4 20:22:55.265: INFO: stderr: ""
Aug 4 20:22:55.265: INFO: stdout: "pod \"csi-rbdplugin-bwwjj\" deleted\n"
Aug 4 20:22:55.265: INFO: Waiting up to 10m0s for all daemonsets in namespace 'cephcsi-e2e-b3b077f5' to start
Aug 4 20:22:55.275: INFO: 0 / 1 pods ready in namespace 'cephcsi-e2e-b3b077f5' in daemonset 'csi-rbdplugin' (0 seconds elapsed)
Aug 4 20:22:57.280: INFO: 0 / 1 pods ready in namespace 'cephcsi-e2e-b3b077f5' in daemonset 'csi-rbdplugin' (2 seconds elapsed)
Aug 4 20:22:59.279: INFO: 1 / 1 pods ready in namespace 'cephcsi-e2e-b3b077f5' in daemonset 'csi-rbdplugin' (4 seconds elapsed)
Aug 4 20:22:59.283: INFO: ExecWithOptions {Command:[/bin/sh -c uname -a] Namespace:cephcsi-e2e-b3b077f5 PodName:csi-rbdplugin-dsnxq ContainerName:csi-rbdplugin Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Aug 4 20:22:59.479: INFO: uname -a: Linux minikube 4.19.182 #1 SMP Fri Jul 2 00:45:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Aug 4 20:22:59.486: INFO: ExecWithOptions {Command:[/bin/sh -c rpm -qa | grep rbd-nbd] Namespace:cephcsi-e2e-b3b077f5 PodName:csi-rbdplugin-dsnxq ContainerName:csi-rbdplugin Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Aug 4 20:23:00.247: INFO: rbd-nbd package version: rbd-nbd-16.2.5-0.el8.x86_64
Aug 4 20:23:00.458: INFO: rbd-nbd process is not running yet: command terminated with exit code 1
Aug 4 20:23:02.463: INFO: ExecWithOptions {Command:[/bin/sh -c ps -eo 'cmd' | grep [r]bd-nbd] Namespace:cephcsi-e2e-b3b077f5 PodName:csi-rbdplugin-dsnxq ContainerName:csi-rbdplugin Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
[...]
Aug 4 20:23:22.463: INFO: ExecWithOptions {Command:[/bin/sh -c ps -eo 'cmd' | grep [r]bd-nbd] Namespace:cephcsi-e2e-b3b077f5 PodName:csi-rbdplugin-dsnxq ContainerName:csi-rbdplugin Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Aug 4 20:23:22.677: INFO: attach command running after restart, runningAttachCmd: rbd-nbd --id cephcsi-rbd-node -m rook-ceph-mon-a.rook-ceph.svc.cluster.local:6789 --keyfile=/tmp/csi/keys/keyfile-667049010 attach replicapool/csi-vol-4fb43356-f559-11eb-8647-566113bf90a7 --device /dev/nbd0 --try-netlink --reattach-timeout=300
Aug 4 20:23:22.682: INFO: ExecWithOptions {Command:[/bin/sh -c echo 'Hello World' > /var/lib/www/html/test] Namespace:rbd-694 PodName:csi-rbd-demo-pod ContainerName:web-server Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Aug 4 20:23:23.708: INFO: stdErr occurred: /bin/sh: 1: cannot create /var/lib/www/html/test: Read-only file system
Aug 4 20:23:23.708: FAIL: failed to write IO, err: command terminated with exit code 2, stdErr: /bin/sh: 1: cannot create /var/lib/www/html/test: Read-only file system
@idryomov @trociny @nixpanic @lxbsz
I had suspected that the image/connection closing at rbd-nbd as part of SIGTERM handling, but after my tests I can see that the image, the IO context, and the connection to RADOS are closed properly with sign-handler/detach/unmap, in the below order
image.close();
io_ctx.close();
rados.shutdown();
Will it be wise to disable rbd_cache
like tcmu-runner, see - https://github.com/open-iscsi/tcmu-runner/blob/master/rbd.c#L589 ?
and see if that solves it for us? (We can just pass --rbd-cache=false
to rbd-nbd cli as an option)
Any other suggestions, thoughts?
@pkalever Actually I was pessimistic about your "sync" patch because it does not help much in general case (you can't guarantee an application of fs would do some more io after sync), and I don't see actually why we need to do this and I don't understand what you meant by "rbd-nbd can sync data" in the comment.
Disabling the rbd case does not look useful here too. When you run sync and it comes to the rbd it will properly flush the cache before returning the sync call.
Anyway we seem still not to have a clear understanding and only guess what is going on here. The dmesg and rbd-nbd logs could be very useful here (especially if run rbd-nbd with debug-rbd=20).
Looking at the log, I suppose the detach was run after sync, i.e. at 20:22:52:
Aug 4 20:22:52.334: INFO: ExecWithOptions {Command:[/bin/sh -c sync /var/lib/www/html] Namespace:rbd-694 PodName:csi-rbd-demo-pod ContainerName:web-server Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
And then the attach was at about 20:23:22:
Aug 4 20:23:22.677: INFO: attach command running after restart, runningAttachCmd: rbd-nbd --id cephcsi-rbd-node -m rook-ceph-mon-a.rook-ceph.svc.cluster.local:6789 --keyfile=/tmp/csi/keys/keyfile-667049010 attach replicapool/csi-vol-4fb43356-f559-11eb-8647-566113bf90a7 --device /dev/nbd0 --try-netlink --reattach-timeout=300
I.e. it is exactly 30 sec, and this is exactly the default rbd-nbd --io-timeout
. So if there were some pending io (waiting for reattach) it could hit this timeout and generate the error. It seems like there is no much sense to have --reattach-timeout
higher than --io-timeout
, and I would recommend to have the same value for both.
Note, as I said it is only a guess about what was going on there.
Hello @trociny
Thanks for looking at this, much appreciate all the help.
@pkalever Actually I was pessimistic about your "sync" patch because it does not help much in general case (you can't guarantee an application of fs would do some more io after sync), and I don't see actually why we need to do this and I don't understand what you meant by "rbd-nbd can sync data" in the comment.
The sync
command was on the e2e test only fixing to temporarily avoid CI failures for a known issue. In this specific case, the application pod is running nginx and we are very sure that there won't be any IO from nginx web server pod post sync
command is run.
By "rbd-nbd can sync data" I'm probably referring to the changes like quiesce on SIGTERM in rbd-nbd itself.
Disabling the rbd case does not look useful here too. When you run the sync and it comes to the rbd it will properly flush the cache before returning the sync call.
Again as mentioned sync
was sent for this specific test case only. What about the general use cases, will disabling the rbd_cache benefit the data consistency in case of ungrateful kill of rbd-nbd process?
I have looked at rbd-nbd code and I see even if we disable rbd_cache the NBD_CMD_FLUSH can still perform the flush on the backend image (this must be intentional). But the wait_inflight_io will not go through as part of the quiesce.
Anyway we seem still not to have a clear understanding and only guess what is going on here. The dmesg and rbd-nbd logs could be very useful here (especially if run rbd-nbd with debug-rbd=20).
It's very unfortunate that CI at ceph-csi doesn't capture the ceph logs yet, and this is something that I'm planning to fix and hopefully, we will hit this once we have the logging in place.
Looking at the log, I suppose the detach was run after sync, i.e. at 20:22:52:
Aug 4 20:22:52.334: INFO: ExecWithOptions {Command:[/bin/sh -c sync /var/lib/www/html] Namespace:rbd-694 PodName:csi-rbd-demo-pod ContainerName:web-server Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
And then the attach was at about 20:23:22:
Aug 4 20:23:22.677: INFO: attach command running after restart, runningAttachCmd: rbd-nbd --id cephcsi-rbd-node -m rook-ceph-mon-a.rook-ceph.svc.cluster.local:6789 --keyfile=/tmp/csi/keys/keyfile-667049010 attach replicapool/csi-vol-4fb43356-f559-11eb-8647-566113bf90a7 --device /dev/nbd0 --try-netlink --reattach-timeout=300
I.e. it is exactly 30 sec, and this is exactly the default rbd-nbd
--io-timeout
. So if there were some pending io (waiting for reattach) it could hit this timeout and generate the error. It seems like there is no much sense to have--reattach-timeout
higher than--io-timeout
, and I would recommend to have the same value for both.
Good catch! Yes, there is a possibility.
Maybe at rbd-nbd we can improve on this, like if reattach-timeout is tweaked and io-timeout is not provided at cli, by default we can adjust io-timeout >= reattach-timeout, or at least throw a warning. I will send the changes later.
Note, as I said it is only a guess about what was going on there.
Yes, noted! But it makes very much sense to keep io-timeout >= reattach-timeout. There is nothing that we could debug further as logs are missing now.
Thanks again!
I.e. it is exactly 30 sec, and this is exactly the default rbd-nbd
--io-timeout
.
@trociny What sets it? We don't have a default value for --io-timeout
option in rbd-nbd (i.e. not specifying --io-timeout
is interpreted as no/infinite timeout) and the kernel doesn't seem to be setting it either. The kernel would default it to 30 seconds if rbd-nbd set NBD_ATTR_TIMEOUT
to 0 but we don't do that. Am I missing something?
Or has the kernel behaviour changed across releases?
I.e. it is exactly 30 sec, and this is exactly the default rbd-nbd
--io-timeout
.@trociny What sets it? We don't have a default value for
--io-timeout
option in rbd-nbd (i.e. not specifying--io-timeout
is interpreted as no/infinite timeout) and the kernel doesn't seem to be setting it either. The kernel would default it to 30 seconds if rbd-nbd setNBD_ATTR_TIMEOUT
to 0 but we don't do that. Am I missing something?
Honestly, I don't know. This is what I recall from the cases reported in the past, that when timeout was not explicitly set via option (or when we did not have this option) the kernel used 30 sec timeout. But my memory is vague here.
And currently we have this in the rbd-nbd man [1]:
Override device timeout. Linux kernel will default to a 30 second request timeout. Allow the user to optionally specify an alternate timeout.
If this is wrong it would be nice to update the man.
[1] https://docs.ceph.com/en/latest/man/8/rbd-nbd/
I have looked at rbd-nbd code and I see even if we disable rbd_cache the NBD_CMD_FLUSH can still perform the flush on the backend image (this must be intentional).
Yes, because even without the cache you still may have inflight io, and the flush just guarantees the caller that all write io that was issued before the flush is complete at the moment when the flush is returned.
I.e. it is exactly 30 sec, and this is exactly the default rbd-nbd
--io-timeout
.@trociny What sets it? We don't have a default value for
--io-timeout
option in rbd-nbd (i.e. not specifying--io-timeout
is interpreted as no/infinite timeout) and the kernel doesn't seem to be setting it either. The kernel would default it to 30 seconds if rbd-nbd setNBD_ATTR_TIMEOUT
to 0 but we don't do that. Am I missing something?
Hello @idryomov and @trociny
I'm on the latest master.
[root@linux-vm1 build]# uname -a
Linux linux-vm1 5.14.0-rc3+ #1 SMP Fri Jul 30 12:51:10 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
[root@linux-vm1 build]# ./bin/rbd-nbd map rbd-pool/image0 --try-netlink
/dev/nbd0
[root@linux-vm1 build]# cat /sys/block/nbd0/queue/io_timeout
30000
[root@linux-vm1 build]# ./bin/rbd-nbd unmap /dev/nbd0
[root@linux-vm1 build]# ./bin/rbd-nbd map rbd-pool/image0 --try-netlink --io-timeout 60
/dev/nbd0
[root@linux-vm1 build]# cat /sys/block/nbd0/queue/io_timeout
60000
[root@linux-vm1 build]# ./bin/rbd-nbd unmap /dev/nbd0
Hence the default seems 30 seconds currently.
The block layer indeed sets it to 30 seconds by default but I had a vague memory of nbd not following that, likely based on an email exchange with a user who had the entire dmesg filled with:
block nbd2: Possible stuck request 000000002f3b7bec: control (write@9158656,4096B). Runtime 210 seconds block nbd2: Possible stuck request 000000005b8accc5: control (write@9007104,4096B). Runtime 210 seconds
messages. They didn't touch --io-timeout
(I asked) but it looks like the kernel was Ubuntu 5.4, so quite old...
When I checked nbd.c yesterday, I got confused by the logic in nbd_set_cmd_timeout()
which repeats the block layer setting of 30 seconds but only when NBD_ATTR_TIMEOUT
is specified. Apparently this is for the purpose of handling a reconfigure with --io-timeout 0
and for the general case the block layer default is now followed. Sorry for the noise!
@idryomov
There is a possible bug in this area, i.e. once we alter io-timeout (say for nbd0), the sysfs will keep it persistent and use the same for future mapped images associated with that blk device (nbd0). We might have hit it many times unknowingly: https://lkml.org/lkml/2021/8/6/713
cc: @trociny @lxbsz @nixpanic
I think most people aren't even aware of --io-timeout
so the fact that it happens to be persisted isn't a big deal. Those who need a particular timeout would always specify it, overwriting the old value. But yeah, looks worth fixing to me (although I might have opted for always explicitly setting the timeout on connect rather than clearing on disconnect).
[1mSTEP[0m: perform IO on rbd-nbd volume after nodeplugin restart
Aug 10 07:52:11.087: INFO: waiting for kubectl (delete -f [] args %!s(MISSING)) to finish
Aug 10 07:52:11.087: INFO: Running '/usr/bin/kubectl --server=https://192.168.39.12:8443 --kubeconfig=/root/.kube/config --namespace=cephcsi-e2e-7a13656f delete -f -'
E0810 07:52:11.087500 79057 util.go:245] kernel 4.18.0-305.10.2.el8_4.x86_64 does not support required features
Aug 10 07:52:11.223: INFO: stderr: "warning: deleting cluster-scoped resources, not scoped to the provided namespace\n"
Aug 10 07:52:11.223: INFO: stdout: "storageclass.storage.k8s.io \"csi-rbd-sc\" deleted\n"
Aug 10 07:52:11.230: INFO: ExecWithOptions {Command:[/bin/sh -c ceph fsid] Namespace:rook-ceph PodName:rook-ceph-tools-7467d8bf8-c67p2 ContainerName:rook-ceph-tools Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Aug 10 07:52:11.230: INFO: >>> kubeConfig: /root/.kube/config
Aug 10 07:52:13.356: INFO: Waiting up to &PersistentVolumeClaim{ObjectMeta:{rbd-pvc rbd-8081 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:PersistentVolumeClaimSpec{AccessModes:[ReadWriteOnce],Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{storage: {{1073741824 0} {<nil>} 1Gi BinarySI},},},VolumeName:,Selector:nil,StorageClassName:*csi-rbd-sc,VolumeMode:nil,DataSource:nil,},Status:PersistentVolumeClaimStatus{Phase:,AccessModes:[],Capacity:ResourceList{},Conditions:[]PersistentVolumeClaimCondition{},},} to be in Bound state
Aug 10 07:52:13.357: INFO: waiting for PVC rbd-pvc (0 seconds elapsed)
Aug 10 07:52:15.365: INFO: waiting for PVC rbd-pvc (2 seconds elapsed)
Aug 10 07:52:15.372: INFO: Waiting for PV pvc-500ee85c-8154-4c2c-81bf-6362dbc02eb3 to bind to PVC rbd-pvc
Aug 10 07:52:15.372: INFO: Waiting up to timeout=10m0s for PersistentVolumeClaims [rbd-pvc] to have phase Bound
Aug 10 07:52:15.375: INFO: PersistentVolumeClaim rbd-pvc found and phase=Bound (2.699166ms)
Aug 10 07:52:15.375: INFO: Waiting up to 10m0s for PersistentVolume pvc-500ee85c-8154-4c2c-81bf-6362dbc02eb3 to have phase Bound
Aug 10 07:52:15.377: INFO: PersistentVolume pvc-500ee85c-8154-4c2c-81bf-6362dbc02eb3 found and phase=Bound (2.469261ms)
Aug 10 07:52:15.390: INFO: Waiting up to csi-rbd-demo-pod to be in Running state
Aug 10 07:52:27.399: INFO: ExecWithOptions {Command:[/bin/sh -c sync /var/lib/www/html] Namespace:rbd-8081 PodName:csi-rbd-demo-pod ContainerName:web-server Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Aug 10 07:52:27.399: INFO: >>> kubeConfig: /root/.kube/config
Aug 10 07:52:27.610: INFO: ExecWithOptions {Command:[/bin/sh -c rbd ls --format=json --pool=replicapool] Namespace:rook-ceph PodName:rook-ceph-tools-7467d8bf8-c67p2 ContainerName:rook-ceph-tools Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Aug 10 07:52:27.610: INFO: >>> kubeConfig: /root/.kube/config
Aug 10 07:52:28.520: INFO: LabelSelector for csi-rbdplugin daemonsets in namespace cephcsi-e2e-7a13656f: app=csi-rbdplugin
Aug 10 07:52:28.520: INFO: waiting for kubectl ([delete po -l app=csi-rbdplugin --ignore-not-found=false] args) to finish
Aug 10 07:52:28.520: INFO: Running '/usr/bin/kubectl --server=https://192.168.39.12:8443 --kubeconfig=/root/.kube/config --namespace=cephcsi-e2e-7a13656f delete po -l app=csi-rbdplugin --ignore-not-found=false'
Aug 10 07:52:30.259: INFO: stderr: ""
Aug 10 07:52:30.259: INFO: stdout: "pod \"csi-rbdplugin-26vkl\" deleted\n"
Aug 10 07:52:30.259: INFO: Waiting up to 10m0s for all daemonsets in namespace 'cephcsi-e2e-7a13656f' to start
Aug 10 07:52:30.262: INFO: 0 / 1 pods ready in namespace 'cephcsi-e2e-7a13656f' in daemonset 'csi-rbdplugin' (0 seconds elapsed)
Aug 10 07:52:32.267: INFO: 0 / 1 pods ready in namespace 'cephcsi-e2e-7a13656f' in daemonset 'csi-rbdplugin' (2 seconds elapsed)
Aug 10 07:52:34.266: INFO: 1 / 1 pods ready in namespace 'cephcsi-e2e-7a13656f' in daemonset 'csi-rbdplugin' (4 seconds elapsed)
Aug 10 07:52:34.270: INFO: ExecWithOptions {Command:[/bin/sh -c uname -a] Namespace:cephcsi-e2e-7a13656f PodName:csi-rbdplugin-dk84f ContainerName:csi-rbdplugin Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Aug 10 07:52:34.270: INFO: >>> kubeConfig: /root/.kube/config
Aug 10 07:52:34.454: INFO: uname -a: Linux minikube 4.19.182 #1 SMP Fri Jul 2 00:45:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Aug 10 07:52:34.461: INFO: ExecWithOptions {Command:[/bin/sh -c rpm -qa | grep rbd-nbd] Namespace:cephcsi-e2e-7a13656f PodName:csi-rbdplugin-dk84f ContainerName:csi-rbdplugin Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Aug 10 07:52:34.461: INFO: >>> kubeConfig: /root/.kube/config
Aug 10 07:52:35.183: INFO: rbd-nbd package version: rbd-nbd-16.2.5-0.el8.x86_64
Aug 10 07:52:35.189: INFO: ExecWithOptions {Command:[/bin/sh -c ps -eo 'cmd' | grep [r]bd-nbd] Namespace:cephcsi-e2e-7a13656f PodName:csi-rbdplugin-dk84f ContainerName:csi-rbdplugin Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Aug 10 07:52:35.189: INFO: >>> kubeConfig: /root/.kube/config
Aug 10 07:52:35.395: INFO: rbd-nbd process is not running yet: command terminated with exit code 1
Aug 10 07:52:37.400: INFO: ExecWithOptions {Command:[/bin/sh -c ps -eo 'cmd' | grep [r]bd-nbd] Namespace:cephcsi-e2e-7a13656f PodName:csi-rbdplugin-dk84f ContainerName:csi-rbdplugin Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Aug 10 07:52:37.400: INFO: >>> kubeConfig: /root/.kube/config
Aug 10 07:52:37.620: INFO: rbd-nbd process is not running yet: command terminated with exit code 1
Aug 10 07:52:39.399: INFO: ExecWithOptions {Command:[/bin/sh -c ps -eo 'cmd' | grep [r]bd-nbd] Namespace:cephcsi-e2e-7a13656f PodName:csi-rbdplugin-dk84f ContainerName:csi-rbdplugin Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Aug 10 07:52:39.399: INFO: >>> kubeConfig: /root/.kube/config
Aug 10 07:52:39.632: INFO: rbd-nbd process is not running yet: command terminated with exit code 1
Aug 10 07:52:41.401: INFO: ExecWithOptions {Command:[/bin/sh -c ps -eo 'cmd' | grep [r]bd-nbd] Namespace:cephcsi-e2e-7a13656f PodName:csi-rbdplugin-dk84f ContainerName:csi-rbdplugin Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Aug 10 07:52:41.401: INFO: >>> kubeConfig: /root/.kube/config
Aug 10 07:52:41.608: INFO: rbd-nbd process is not running yet: command terminated with exit code 1
Aug 10 07:52:43.401: INFO: ExecWithOptions {Command:[/bin/sh -c ps -eo 'cmd' | grep [r]bd-nbd] Namespace:cephcsi-e2e-7a13656f PodName:csi-rbdplugin-dk84f ContainerName:csi-rbdplugin Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Aug 10 07:52:43.401: INFO: >>> kubeConfig: /root/.kube/config
Aug 10 07:52:43.603: INFO: rbd-nbd process is not running yet: command terminated with exit code 1
Aug 10 07:52:45.400: INFO: ExecWithOptions {Command:[/bin/sh -c ps -eo 'cmd' | grep [r]bd-nbd] Namespace:cephcsi-e2e-7a13656f PodName:csi-rbdplugin-dk84f ContainerName:csi-rbdplugin Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Aug 10 07:52:45.400: INFO: >>> kubeConfig: /root/.kube/config
Aug 10 07:52:45.630: INFO: rbd-nbd process is not running yet: command terminated with exit code 1
Aug 10 07:52:47.399: INFO: ExecWithOptions {Command:[/bin/sh -c ps -eo 'cmd' | grep [r]bd-nbd] Namespace:cephcsi-e2e-7a13656f PodName:csi-rbdplugin-dk84f ContainerName:csi-rbdplugin Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Aug 10 07:52:47.399: INFO: >>> kubeConfig: /root/.kube/config
Aug 10 07:52:47.605: INFO: rbd-nbd process is not running yet: command terminated with exit code 1
Aug 10 07:52:49.400: INFO: ExecWithOptions {Command:[/bin/sh -c ps -eo 'cmd' | grep [r]bd-nbd] Namespace:cephcsi-e2e-7a13656f PodName:csi-rbdplugin-dk84f ContainerName:csi-rbdplugin Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Aug 10 07:52:49.400: INFO: >>> kubeConfig: /root/.kube/config
Aug 10 07:52:49.604: INFO: rbd-nbd process is not running yet: command terminated with exit code 1
Aug 10 07:52:51.400: INFO: ExecWithOptions {Command:[/bin/sh -c ps -eo 'cmd' | grep [r]bd-nbd] Namespace:cephcsi-e2e-7a13656f PodName:csi-rbdplugin-dk84f ContainerName:csi-rbdplugin Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Aug 10 07:52:51.400: INFO: >>> kubeConfig: /root/.kube/config
Aug 10 07:52:51.605: INFO: rbd-nbd process is not running yet: command terminated with exit code 1
Aug 10 07:52:53.400: INFO: ExecWithOptions {Command:[/bin/sh -c ps -eo 'cmd' | grep [r]bd-nbd] Namespace:cephcsi-e2e-7a13656f PodName:csi-rbdplugin-dk84f ContainerName:csi-rbdplugin Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Aug 10 07:52:53.400: INFO: >>> kubeConfig: /root/.kube/config
Aug 10 07:52:53.598: INFO: rbd-nbd process is not running yet: command terminated with exit code 1
Aug 10 07:52:55.401: INFO: ExecWithOptions {Command:[/bin/sh -c ps -eo 'cmd' | grep [r]bd-nbd] Namespace:cephcsi-e2e-7a13656f PodName:csi-rbdplugin-dk84f ContainerName:csi-rbdplugin Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Aug 10 07:52:55.401: INFO: >>> kubeConfig: /root/.kube/config
Aug 10 07:52:55.618: INFO: rbd-nbd process is not running yet: command terminated with exit code 1
Aug 10 07:52:57.401: INFO: ExecWithOptions {Command:[/bin/sh -c ps -eo 'cmd' | grep [r]bd-nbd] Namespace:cephcsi-e2e-7a13656f PodName:csi-rbdplugin-dk84f ContainerName:csi-rbdplugin Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Aug 10 07:52:57.401: INFO: >>> kubeConfig: /root/.kube/config
Aug 10 07:52:57.617: INFO: attach command running after restart, runningAttachCmd: rbd-nbd --id cephcsi-rbd-node -m rook-ceph-mon-a.rook-ceph.svc.cluster.local:6789 --keyfile=/tmp/csi/keys/keyfile-084204234 attach replicapool/csi-vol-7e003901-f9a7-11eb-ae90-ee515005cb5b --device /dev/nbd0 --try-netlink --reattach-timeout=300
Aug 10 07:52:57.621: INFO: ExecWithOptions {Command:[/bin/sh -c echo 'Hello World' > /var/lib/www/html/test] Namespace:rbd-8081 PodName:csi-rbd-demo-pod ContainerName:web-server Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Aug 10 07:52:57.621: INFO: >>> kubeConfig: /root/.kube/config
Aug 10 07:52:58.973: INFO: stdErr occurred: /bin/sh: 1: cannot create /var/lib/www/html/test: Read-only file system
Aug 10 07:52:58.973: FAIL: failed to write IO, err: command terminated with exit code 2, stdErr: /bin/sh: 1: cannot create /var/lib/www/html/test: Read-only file system
Hit count +1
https://jenkins-ceph-csi.apps.ocp.ci.centos.org/blue/rest/organizations/jenkins/pipelines/mini-e2e_k8s-1.21/runs/709/nodes/95/log/?start=0
Hit count +1
Is there a dmesg for this (or any other) occurrence? I'm still not sure where the timeout is coming from.
Hit count +1
Is there a dmesg for this (or any other) occurrence? I'm still not sure where the timeout is coming from.
@idryomov , you can find dmesg logs here https://jenkins-ceph-csi.apps.ocp.ci.centos.org/blue/rest/organizations/jenkins/pipelines/mini-e2e_k8s-1.20/runs/2053/nodes/104/log/?start=0 for one of the runs where this issue occured.
https://jenkins-ceph-csi.apps.ocp.ci.centos.org/blue/organizations/jenkins/mini-e2e_k8s-1.20/detail/mini-e2e_k8s-1.20/2053/pipeline/104
its part of log system status
step logs in jenkings pipeline.
Thanks, Rakshith. So it indeed went through the "normal" io-timeout path:
Aug 10 10:31:48 minikube kernel: EXT4-fs (nbd0): mounted filesystem with ordered data mode. Opts: discard
Aug 10 10:31:52 minikube kernel: block nbd0: Receive control failed (result -32)
Aug 10 10:32:03 minikube kernel: block nbd0: Attempted send on invalid socket
Aug 10 10:32:18 minikube kernel: block nbd0: reconnected socket
>>>>>> Aug 10 10:32:22 minikube kernel: block nbd0: Connection timed out <<<<<<
Aug 10 10:32:22 minikube kernel: block nbd0: shutting down sockets
Aug 10 10:32:22 minikube kernel: print_req_error: I/O error, dev nbd0, sector 0
Aug 10 10:32:22 minikube kernel: print_req_error: I/O error, dev nbd0, sector 1048608
Aug 10 10:32:22 minikube kernel: Aborting journal on device nbd0-8.
Aug 10 10:32:22 minikube kernel: EXT4-fs error (device nbd0) in __ext4_new_inode:985: Journal has aborted
Aug 10 10:32:22 minikube kernel: print_req_error: I/O error, dev nbd0, sector 1048576
Aug 10 10:32:22 minikube kernel: block nbd0: NBD_DISCONNECT
Aug 10 10:32:22 minikube kernel: block nbd0: Send disconnect failed -32
Aug 10 10:32:22 minikube kernel: print_req_error: I/O error, dev nbd0, sector 1048576
Aug 10 10:32:22 minikube kernel: Buffer I/O error on dev nbd0, logical block 131072, lost sync page write
Aug 10 10:32:22 minikube kernel: JBD2: Error -5 detected when updating journal superblock for nbd0-8.
Aug 10 10:32:22 minikube kernel: print_req_error: I/O error, dev nbd0, sector 0
Aug 10 10:32:22 minikube kernel: print_req_error: I/O error, dev nbd0, sector 0
Aug 10 10:32:22 minikube kernel: Buffer I/O error on dev nbd0, logical block 0, lost sync page write
Aug 10 10:32:22 minikube kernel: EXT4-fs (nbd0): I/O error while writing superblock
Aug 10 10:32:22 minikube kernel: EXT4-fs error (device nbd0) in ext4_create:2530: Journal has aborted
Aug 10 10:32:22 minikube kernel: print_req_error: I/O error, dev nbd0, sector 0
Aug 10 10:32:22 minikube kernel: print_req_error: I/O error, dev nbd0, sector 0
Aug 10 10:32:22 minikube kernel: Buffer I/O error on dev nbd0, logical block 0, lost sync page write
Aug 10 10:32:22 minikube kernel: EXT4-fs (nbd0): I/O error while writing superblock
But this same code seems to be responsible for resubmitting requests to the new rbd-nbd instance after a restart, so setting
io-timeout
to the same value as reattach-timeout
(which is really a dead socket timeout) is probably not a good idea...
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.
To address this, we have alredy set io-timeout=0
by default, but as @idryomov rightly mentioned at https://github.com/ceph/ceph-csi/pull/2394#issuecomment-903636068
Kernel 4.19.182 doesn't handle io-timeout=0 correctly because it lacks torvalds/linux@2da22da. In general, the nbd driver in older kernels is quite broken so you may need to constrain nbd tests to e.g. CentOS 8.4.
I'm hoping for kernels without torvalds/linux@2da22da we should use the quiesce on signal mechanism https://github.com/ceph/ceph/pull/42609
cc: @nixpanic @Madhu-1
To address this, we have alredy set
io-timeout=0
by default, but as @idryomov rightly mentioned at #2394 (comment)Kernel 4.19.182 doesn't handle io-timeout=0 correctly because it lacks torvalds/linux@2da22da. In general, the nbd driver in older kernels is quite broken so you may need to constrain nbd tests to e.g. CentOS 8.4.
I'm hoping for kernels without torvalds/linux@2da22da we should use the quiesce on signal mechanism ceph/ceph#42609
cc: @nixpanic @Madhu-1
@nixpanic @Madhu-1 @humblec can we please take a call on this?
Hello,
I'm not sure i'm hitting the same case that this issue is talking about but I have some volumes mounted as nbd that are mounted as read-only on some pod even though they are mounted as ReadWriteOnce and without read-only option. I'm running ceph octopus and csi-driver 3.7.2.
Does anyone know if I'm in this case, and if any workaround exists?
Thanks
@quentinleclerc please check do you have clients blocklisted on the ceph cluster and also check dmesg
on the node where you have this problem. the workaround is restarting the node for now.
Thanks @Madhu-1 indeed my node was spamming with "libceph: auth method 'x' error -13" and others I/O errors. A restart did solve the problem (for now)
spamming with "libceph: auth method 'x' error -13" and others I/O errors
This means that you were (perhaps at some point in the past?) using the krbd mounter, not the rbd-nbd mounter.
Yes indeed at first I had the storage class running with default options but some of my pods had incompatibility with krbd and thus I activated the tryOtherMounters option to fallback to nbd.
parameters:
tryOtherMounters: "true"