node-driver-registrar Node driver registrar says "Lost connection" and do nothing

Log:

I0323 21:20:55.915743       1 main.go:110] Version: unknown
I0323 21:20:55.915816       1 main.go:120] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0323 21:20:55.915833       1 connection.go:151] Connecting to unix:///csi/csi.sock
I0323 21:20:55.916332       1 main.go:127] Calling CSI driver to discover driver name
I0323 21:20:55.916348       1 connection.go:180] GRPC call: /csi.v1.Identity/GetPluginInfo
I0323 21:20:55.916354       1 connection.go:181] GRPC request: {}
I0323 21:20:55.917907       1 connection.go:183] GRPC response: {"name":"x","vendor_version":"0.3.0"}
I0323 21:20:55.918413       1 connection.go:184] GRPC error: <nil>
I0323 21:20:55.918421       1 main.go:137] CSI driver name: "x"
I0323 21:20:55.918477       1 node_register.go:58] Starting Registration Server at: /registration/x-reg.sock
I0323 21:20:55.918603       1 node_register.go:67] Registration Server started at: /registration/x-reg.sock
I0323 21:20:57.567609       1 main.go:77] Received GetInfo call: &InfoRequest{}
I0323 21:20:57.671681       1 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
E0323 21:21:15.690492       1 connection.go:129] Lost connection to unix:///csi/csi.sock.

After restart container, it works.

Mar 26 '21 07:03 jim3ma

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

Jun 24 '21 07:06 fejta-bot

/remove-lifecycle stale

Jul 12 '21 17:07 aramase

To restart the container automatically, #152 would help you.

Oct 07 '21 07:10 toshipp

I am experiencing the same behavior. Given all 3 containers (node-driver-registrar, secrets-store and liveness-probe) are running, if secrets-store gets restarted, node-driver-registrar loses connection to unix:///csi/csi.sock.

Node-driver-registrar logs:

I1012 11:51:22.299487 1 main.go:164] Version: v2.3.0 I1012 11:51:22.299515 1 main.go:165] Running node-driver-registrar in mode=registration I1012 11:51:22.299913 1 main.go:189] Attempting to open a gRPC connection with: "/csi/csi.sock" I1012 11:51:22.299936 1 connection.go:154] Connecting to unix:///csi/csi.sock I1012 11:51:22.300436 1 main.go:196] Calling CSI driver to discover driver name I1012 11:51:22.300451 1 connection.go:183] GRPC call: /csi.v1.Identity/GetPluginInfo I1012 11:51:22.300456 1 connection.go:184] GRPC request: {} I1012 11:51:22.303030 1 connection.go:186] GRPC response: {"name":"secrets-store.csi.k8s.io","vendor_version":"v0.3.0"} I1012 11:51:22.303071 1 connection.go:187] GRPC error: I1012 11:51:22.303078 1 main.go:206] CSI driver name: "secrets-store.csi.k8s.io" I1012 11:51:22.303115 1 node_register.go:52] Starting Registration Server at: /registration/secrets-store.csi.k8s.io-reg.sock I1012 11:51:22.303301 1 node_register.go:61] Registration Server started at: /registration/secrets-store.csi.k8s.io-reg.sock I1012 11:51:22.303447 1 node_register.go:91] Skipping healthz server because HTTP endpoint is set to: "" I1012 11:51:23.540046 1 main.go:100] Received GetInfo call: &InfoRequest{} I1012 11:51:23.540214 1 main.go:107] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/csi-secrets-store/registration" I1012 11:51:23.577520 1 main.go:118] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,} E1012 11:51:25.218944 1 connection.go:132] Lost connection to unix:///csi/csi.sock.

Secret store logs:

I1012 11:53:06.149183 1 exporter.go:33] metrics backend: prometheus I1012 11:53:06.205017 1 main.go:169] "provider health check enabled" interval="1m0s" I1012 11:53:06.205636 1 secrets-store.go:40] "Initializing Secrets Store CSI Driver" driver="secrets-store.csi.k8s.io" version="v0.3.0" buildTime="2021-08-30-18:38" I1012 11:53:06.205852 1 server.go:117] "Listening for connections" address="//csi/csi.sock" I1012 11:53:06.205923 1 main.go:174] starting manager I1012 11:53:06.206014 1 reconciler.go:131] starting rotation reconciler with poll interval: 10s

Liveness probe logs:

I1012 11:51:22.555987 I1012 11:51:22.558382 I1012 11:51:22.558401 E1012 11:52:05.885988 W1012 11:52:12.886013 E1012 11:52:20.885852 W1012 11:52:22.885321 W1012 11:52:27.886239 W1012 11:52:32.885371 E1012 11:52:35.885665 W1012 11:52:37.886239 W1012 11:52:42.885652 W1012 11:52:42.885679 W1012 11:52:47.885302 E1012 11:52:50.885540 W1012 11:52:52.885269 W1012 11:52:52.886357 W1012 11:52:57.885542 W1012 11:52:57.885565 W1012 11:53:02.885251 W1012 11:53:02.885407 E1012 11:53:05.885994 1 main.go:149] calling CSI driver to discover driver name 1 main.go:155] CSI driver name: "secrets-store.csi.k8s.io" 1 main.go:183] ServeMux listening at "0.0.0.0:9808" 1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded 1 connection.go:173] Still connecting to unix:///csi/csi.sock 1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded 1 connection.go:173] Still connecting to unix:///csi/csi.sock 1 connection.go:173] Still connecting to unix:///csi/csi.sock 1 connection.go:173] Still connecting to unix:///csi/csi.sock 1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded 1 connection.go:173] Still connecting to unix:///csi/csi.sock 1 connection.go:173] Still connecting to unix:///csi/csi.sock 1 connection.go:173] Still connecting to unix:///csi/csi.sock 1 connection.go:173] Still connecting to unix:///csi/csi.sock 1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded 1 connection.go:173] Still connecting to unix:///csi/csi.sock 1 connection.go:173] Still connecting to unix:///csi/csi.sock 1 connection.go:173] Still connecting to unix:///csi/csi.sock 1 connection.go:173] Still connecting to unix:///csi/csi.sock 1 connection.go:173] Still connecting to unix:///csi/csi.sock 1 connection.go:173] Still connecting to unix:///csi/csi.sock 1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded

To restart the container automatically, #152 would help you.

That does not help, already have this:

name: node-driver-registrar image: "..." args: - --v=5 - --csi-address=/csi/csi.sock - --kubelet-registration-path=/var/lib/kubelet/plugins/csi-secrets-store/csi.sock livenessProbe: exec: command: - /csi-node-driver-registrar - --kubelet-registration-path=/var/lib/kubelet/plugins/csi-secrets-store/csi.sock - --mode=kubelet-registration-probe initialDelaySeconds: 3

Oct 25 '21 07:10 yatanasov-hs

To restart the container automatically, #152 would help you.

This helps during the startup of node-driver-registrar if the kubelet plugin registration fails but these logs show that the kubelet plugin registration succeeded so the workaround in #152 can't help here:

I0323 21:20:57.567609       1 main.go:77] Received GetInfo call: &InfoRequest{}
I0323 21:20:57.671681       1 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}

@yatanasov-hs thanks for the logs, I see that liveness probe attempted to connect after node-driver-registrar logged lost connection, looking at the timestamps:

# node-driver-registrar
E1012 11:51:25.218944 1 connection.go:132] Lost connection to unix:///csi/csi.sock

# livenessprobe
E1012 11:52:05.885988 1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded
W1012 11:52:12.886013 1 connection.go:173] Still connecting to unix:///csi/csi.sock

The error Lost connection to {{address}} comes from https://github.com/kubernetes-csi/csi-lib-utils/blob/a2ccb594bb74b61a0655d3431c6365a6a567c7af/connection/connection.go?#L132, which by default will attempt to reconnect, I think I can add a log after a successful dial to rule out the dial timeout.

Both node-driver-registrar or livenessprobe should be able to reestablish the connection to csi.sock if they lost it so still not sure why they couldn't.

Oct 25 '21 18:10 mauriciopoppe

I'm having same problem after upgrade okd 4.7 to 4.8.

Oct 26 '21 11:10 chelobarreto

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 24 '22 12:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Feb 23 '22 13:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Mar 25 '22 13:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 25 '22 13:03 k8s-ci-robot

Hi, I happened to see the same behaviour on k8s 1.24 and quay.io/k8scsi/csi-node-driver-registrar:v1.2.0, any idea of what might be causing this?

May 31 '23 14:05 fradeve

/reopen /remove-lifecycle rotten /lifecycle frozen /kind bug

Jun 01 '23 04:06 mauriciopoppe

@mauriciopoppe: Reopened this issue.

In response to this:

/reopen /remove-lifecycle rotten /lifecycle frozen /kind bug

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jun 01 '23 04:06 k8s-ci-robot

I got the same issue on AKS cluster in version 1.27.3. The csi-azurefile-node pods started failing on one of the node pools (1 of 3 node pools). Still looking into what caused it but best guess atm is something failing to startup properly after node upgrade during automatic maintenance window.

Oct 26 '23 13:10 BertelBB

I am also observing this on our AWS EKS 1.27.4. Exactly the same output as the comment above.

Nov 15 '23 09:11 Rohlik

@gnufied Does this PR https://github.com/kubernetes-csi/node-driver-registrar/pull/322 fixes the current(#139) issue ?

Dec 13 '23 13:12 basaveswar-kureti

It should help with yes.

Dec 13 '23 17:12 gnufied

Thanks @gnufied for the confirmation. We tested this internally on a bunch of our k8s clusters which is using csi-node-driver-registrar version of v2.9.2 (so anything >v2.9.0) has not seen this issue.

Dec 13 '23 18:12 basaveswar-kureti

Thanks @gnufied, I wanted to expand on the effect of #322 (released in https://github.com/kubernetes-csi/node-driver-registrar/releases/tag/v2.9.0) and this issue.

The possible scenario where we might see errors like https://github.com/kubernetes-csi/node-driver-registrar/issues/139#issuecomment-950590661 or https://github.com/kubernetes-csi/node-driver-registrar/issues/139#issuecomment-1781103638:

E1012 11:52:05.885988 1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded
W1012 11:52:12.886013 1 connection.go:173] Still connecting to unix:///csi/csi.sock
E1012 11:51:25.218944 1 connection.go:132] Lost connection to unix:///csi/csi.sock.

Is if the csi.sock owned by the CSI Driver node component suddenly dies which could happen because the CSI Driver container restarted. Let's assume that the CSI Driver node component was restarted, the usual behavior for node-driver-registrar is to show Lost Connection to ... and to attempt reconnection, however it looks like after the restart, the CSI Driver node component creates the socket and node-driver-registrar can see the socket file available but can't connect to it.

#322 is adding a throughout check on the /healthz endpoint so that in addition to checking the existence of the socket file node-driver-registrar makes a successful CSI call (NodeGetInfo) against the socket, if the call fails consistently then eventually the container will be restarted. Note that for this workaround the deployment manifests must enable a livenessProbe probing the /healthz endpoint, for more info about the /healthz setup please check this example in the README.

Dec 13 '23 18:12 mauriciopoppe

@mauriciopoppe sorry for late response, but the way I have seen this issue to happen is:

CSI driver pod and registrar are running per-usual.
Something causes another registrar container to be created (In my case, this happened because someone was trying to debug the driver-pod and created parallel container)
When this happens, the new registrar creates a new registration socket file (which replaces old one, created in step#1) and since driver is still running, kubelet sees this new registration socket and everything works fine.
Now when this debug container is terminated, it just leaves the registration socket file. But nobody is listening on it and since registrar doesn't have a mechanism of retrying, kubelet sees driver as removed.

So https://github.com/kubernetes-csi/node-driver-registrar/pull/322 mainly fixes issue with stale registration sockets, assuming health check is enabled on registration socket.

Now, I know that this issue has bugs linked where csi-driver socket is stale, but in current iteration of node-registrar, we only connect to csi-driver socket once on startup. We are in-fact closing the connection after retrieving driver-name and if for some reason node registrar can't connect to csi driver socket, it will simply exit on startup and restart itself. But as such - #322 does not fixes stale csi driver sockets issues.

Jan 10 '24 19:01 gnufied

node-driver-registrar node-driver-registrar copied to clipboard

Node driver registrar says "Lost connection" and do nothing

node-driver-registrar
node-driver-registrar copied to clipboard