csi-driver-nfs icon indicating copy to clipboard operation
csi-driver-nfs copied to clipboard

Driver registration appears broken

Open kbreit opened this issue 6 months ago • 7 comments

What happened: I am deploying Prometheus with the NFS driver providing persistent storage. The CSI driver is installed via Helm and all controller and node pods are up and READY. When I run my Prometheus pod it doesn't start successfully with the following error:

  Warning  FailedMount  52s (x702 over 23h)  kubelet  MountVolume.MountDevice failed for volume "pvc-d37818b3-5bb4-49ea-99c7-984579fe6871" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name nfs.csi.k8s.io not found in the list of registered CSI drivers

Here are logs for the 3 containers in the node pod for that node.

❯ kubectl -n kube-system logs csi-nfs-node-nwpqj
Defaulted container "liveness-probe" out of: liveness-probe, node-driver-registrar, nfs
I0528 03:23:59.037639       1 main.go:135] "Calling CSI driver to discover driver name"
I0528 03:23:59.039000       1 main.go:143] "CSI driver name" driver="nfs.csi.k8s.io"
I0528 03:23:59.039049       1 main.go:172] "ServeMux listening" address="localhost:29653"

k8s-infra on  main [!?]
❯ kubectl -n kube-system logs csi-nfs-node-nwpqj -c node-driver-registrar
I0528 03:22:19.505312       1 main.go:150] "Version" version="v2.13.0"
I0528 03:22:19.505424       1 main.go:151] "Running node-driver-registrar" mode=""
I0528 03:22:19.505435       1 main.go:172] "Attempting to open a gRPC connection" csiAddress="/csi/csi.sock"
I0528 03:22:19.506264       1 main.go:180] "Calling CSI driver to discover driver name"
I0528 03:22:19.507908       1 main.go:189] "CSI driver name" csiDriverName="nfs.csi.k8s.io"
I0528 03:22:19.508006       1 node_register.go:56] "Starting Registration Server" socketPath="/registration/nfs.csi.k8s.io-reg.sock"
I0528 03:22:19.508201       1 node_register.go:66] "Registration Server started" socketPath="/registration/nfs.csi.k8s.io-reg.sock"
I0528 03:22:19.508413       1 node_register.go:96] "Skipping HTTP server"
I0528 03:22:20.728591       1 main.go:96] "Received GetInfo call" request="&InfoRequest{}"
I0528 03:22:21.392540       1 main.go:108] "Received NotifyRegistrationStatus call" status="&RegistrationStatus{PluginRegistered:true,Error:,}"

k8s-infra on  main [!?]
❯ kubectl -n kube-system logs csi-nfs-node-nwpqj -c nfs
I0528 03:22:19.602554       1 nfs.go:90] Driver: nfs.csi.k8s.io version: v4.11.0
I0528 03:22:19.602723       1 nfs.go:147]
DRIVER INFORMATION:
-------------------
Build Date: "2025-03-18T13:07:23Z"
Compiler: gc
Driver Name: nfs.csi.k8s.io
Driver Version: v4.11.0
Git Commit: ""
Go Version: go1.23.6
Platform: linux/amd64

Streaming logs below:
I0528 03:22:19.605856       1 mount_linux.go:334] Detected umount with safe 'not mounted' behavior
I0528 03:22:19.606155       1 server.go:117] Listening for connections on address: &net.UnixAddr{Name:"//csi/csi.sock", Net:"unix"}
I0528 03:22:20.599876       1 utils.go:111] GRPC call: /csi.v1.Identity/GetPluginInfo
I0528 03:22:20.599971       1 utils.go:112] GRPC request: {}
I0528 03:22:20.602465       1 utils.go:118] GRPC response: {"name":"nfs.csi.k8s.io","vendor_version":"v4.11.0"}
I0528 03:22:20.729465       1 utils.go:111] GRPC call: /csi.v1.Node/NodeGetInfo
I0528 03:22:20.729481       1 utils.go:112] GRPC request: {}
I0528 03:22:20.729512       1 utils.go:118] GRPC response: {"node_id":"kbedge001"}
I0528 03:22:38.028051       1 utils.go:111] GRPC call: /csi.v1.Identity/GetPluginInfo
I0528 03:22:38.028308       1 utils.go:112] GRPC request: {}
I0528 03:22:38.028480       1 utils.go:118] GRPC response: {"name":"nfs.csi.k8s.io","vendor_version":"v4.11.0"}
I0528 03:23:08.018751       1 utils.go:111] GRPC call: /csi.v1.Identity/GetPluginInfo
I0528 03:23:08.018762       1 utils.go:112] GRPC request: {}
I0528 03:23:08.018816       1 utils.go:118] GRPC response: {"name":"nfs.csi.k8s.io","vendor_version":"v4.11.0"}
I0528 03:23:59.038230       1 utils.go:111] GRPC call: /csi.v1.Identity/GetPluginInfo
I0528 03:23:59.038265       1 utils.go:112] GRPC request: {}
I0528 03:23:59.038285       1 utils.go:118] GRPC response: {"name":"nfs.csi.k8s.io","vendor_version":"v4.11.0"}

While I don't know exactly what the problem is I did notice there is nothing for the NFS CSI driver in the plugins_registry directory, only other CSI drivers I have tried as part of troubleshooting for this problem.

root@kbedge001:/var/lib/kubelet/plugins_registry# ls
container-image.csi.k8s.io-reg.sock  org.democratic-csi.local-hostpath-reg.sock

I confirmed this directory is represented the same in the container by looking at the container filesystem in the /proc/<pid>/root/registration directory. Note, this may not be the problem but it's something that stuck out to me. Finally, the csidriver is visible.

❯ kubectl get csidriver nfs.csi.k8s.io
NAME             ATTACHREQUIRED   PODINFOONMOUNT   STORAGECAPACITY   TOKENREQUESTS   REQUIRESREPUBLISH   MODES        AGE
nfs.csi.k8s.io   false            false            false             <unset>         false               Persistent   2d8h

What you expected to happen: I expected my pod to come up and have the registration happen transparently, if it isn't already.

How to reproduce it:

  1. Install NFS CSI driver on k3s, Helm method probably too
  2. Create PVC
  3. Create pod bound to the new PVC

Anything else we need to know?:

values.yaml file is blank so it should be using defaults.

Environment:

  • CSI Driver version: v4.11.0
  • Kubernetes version (use kubectl version): 1.31.5+k3s1
  • OS (e.g. from /etc/os-release): Ubuntu 24.04.1 LTS
  • Kernel (e.g. uname -a): Linux kbedge001 6.8.0-55-generic #57-Ubuntu SMP PREEMPT_DYNAMIC Wed Feb 12 23:42:21 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: Helm
  • Others:

kbreit avatar May 30 '25 11:05 kbreit

is it related to kubeletDir issue? https://github.com/kubernetes-csi/csi-driver-nfs/tree/master/charts#tips

andyzhangx avatar May 30 '25 12:05 andyzhangx

@andyzhangx I've thought of that but I am not sure since /var/lib/kubelet and /var/lib/rancher/k3s both exist and have data in them. For example, here's the file and directory listing in /var/lib/kubelet.

root@kbedge001:/var/lib/kubelet# ls -l
total 32
drwx------  2 root root 4096 Feb 12 14:30 checkpoints
-rw-------  1 root root   62 Feb 12 14:30 cpu_manager_state
drwxr-xr-x  2 root root 4096 May 22 18:33 device-plugins
-rw-------  1 root root   61 Feb 12 14:30 memory_manager_state
drwxr-x---  5 root root 4096 May 30 03:02 plugins
drwxr-x---  2 root root 4096 May 30 03:02 plugins_registry
drwxr-x---  2 root root 4096 May 22 18:33 pod-resources
drwxr-x--- 28 root root 4096 May 30 03:02 pods

and the /var/lib/rancher/k3s/agent/ directory has a lot of kubeconfigs and certificate/key pairs.

root@kbedge001:/var/lib/rancher/k3s/agent# ls
client-ca.crt              client-kubelet.crt     client-kube-proxy.key  k3scontroller.kubeconfig  pod-manifests        serving-kubelet.key
client-k3s-controller.crt  client-kubelet.key     containerd             kubelet.kubeconfig        server-ca.crt
client-k3s-controller.key  client-kube-proxy.crt  etc                    kubeproxy.kubeconfig      serving-kubelet.crt

Regardless, do you think it's worth my overriding the kubelet directory to the rancher/k3s one?

kbreit avatar May 30 '25 14:05 kbreit

I reverted to an older version of the images via Helm and it may have fixed it. Here are my versions. Are you aware of known regressions which may have caused this?

csi-driver-nfs:
  image:
    livenessProbe:
      tag: v2.15.0
    nfs:
      tag: v4.10.0
    nodeDriverRegistrar:
      tag: v2.12.0
    csiProvisioner:
      tag: v5.2.0

kbreit avatar May 31 '25 01:05 kbreit

@kbreit i mainly depends on livenessProbe, nfs, nodeDriverRegistrar version, can you check which image upgrade fixed the issue?

andyzhangx avatar May 31 '25 02:05 andyzhangx

Hello, I went in identical issue.

What's most surprising - it is happening only on 1 of 2 nodes

I'm running a small rke2 cluster (v1.32.5+rke2r1) that is containing 3 master (control plane) nodes One of nodes has applied taint of CriticalAddons since it is some old Pentium PC for storage only.

Other 2 PCs are fully operable nodes of which one is not able to run any workload that requires NFS storage due to same issue: Warning FailedMount 61s (x6 over 11m) kubelet MountVolume.MountDevice failed for volume "xxx" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name nfs.csi.k8s.io not found in the list of registered CSI drivers

I've tried to find anything here: https://github.com/kubernetes-csi/csi-driver-nfs/tree/master/charts#tips as well as I tried to use versions provided by @kbreit. Unfortunately that didn't fix anything I see no errors in controller pod, nor in node pod. I've even tried to increase logLevel to 10 (I don't know what value would be equal to DEBUG mode).

Only difference I see between working and not working node is:

node-driver-registrar I0614 20:48:38.199258 1 main.go:96] "Received GetInfo call" request="&InfoRequest{}" node-driver-registrar I0614 20:48:38.245209 1 main.go:108] "Received NotifyRegistrationStatus call" status="&RegistrationStatus{PluginRegistered:true,Error:,}"

Those two logs appear on node pod that is running on properly working worker Except that there is no difference in initialization.

Mrkazik99 avatar Jun 14 '25 22:06 Mrkazik99

I forgot to add, that I've tested NFS connection between worker nodes by running command showmount -e <nfs-server> I've even successfully mounted a test nfs share to both worker nodes

Mrkazik99 avatar Jun 14 '25 22:06 Mrkazik99

Hi All, I was seeing image pull error when using the values from latest version : https://github.com/kubernetes-csi/csi-driver-nfs/blob/3dceb4a88526b16c7e3d6f7bf613758e7303c673/charts/latest/csi-driver-nfs/values.yaml

kubectl version Client Version: v1.33.2 Kustomize Version: v5.6.0 Server Version: v1.32.3

kubectl --namespace=kube-system get pods NAME READY STATUS RESTARTS AGE calico-kube-controllers-5947598c79-jj6dd 1/1 Running 7 (26h ago) 92d calico-node-9t5zf 1/1 Running 4 (40h ago) 71d calico-node-hq7wn 1/1 Running 2 (3d16h ago) 64d calico-node-smfng 1/1 Running 6 (26h ago) 64d csi-nfs-controller-55f6dc8854-v2nwx 1/5 CrashLoopBackOff 18 (3m41s ago) 12m csi-nfs-node-46nht 3/3 Running 16 (26h ago) 58d csi-nfs-node-b8wzv 1/3 ImagePullBackOff 6 (3m44s ago) 12m csi-nfs-node-qhvgh 3/3 Running 12 (40h ago) 58d

when reverted to my previous values all of the csi pods came up,

image:
    baseRepo: registry.k8s.io
    nfs:
        repository: registry.k8s.io/sig-storage/nfsplugin
        tag: v4.11.0
        pullPolicy: IfNotPresent
    csiProvisioner:
        repository: registry.k8s.io/sig-storage/csi-provisioner
        tag: v5.2.0
        pullPolicy: IfNotPresent
    csiResizer:
        repository: registry.k8s.io/sig-storage/csi-resizer
        tag: v1.13.1
        pullPolicy: IfNotPresent
    csiSnapshotter:
        repository: registry.k8s.io/sig-storage/csi-snapshotter
        tag: v8.2.0
        pullPolicy: IfNotPresent
    livenessProbe:
        repository: registry.k8s.io/sig-storage/livenessprobe
        tag: v2.15.0
        pullPolicy: IfNotPresent
    nodeDriverRegistrar:
        repository: registry.k8s.io/sig-storage/csi-node-driver-registrar
        tag: v2.13.0
        pullPolicy: IfNotPresent
    externalSnapshotter:
        repository: registry.k8s.io/sig-storage/snapshot-controller
        tag: v8.2.0
        pullPolicy: IfNotPresent

This may not be related to this issue, just want to share to see if someone looking for working values, Thanks

trishtechadmin avatar Jun 26 '25 12:06 trishtechadmin

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 24 '25 12:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Oct 24 '25 13:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Nov 23 '25 14:11 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Nov 23 '25 14:11 k8s-ci-robot