node-driver-registrar icon indicating copy to clipboard operation
node-driver-registrar copied to clipboard

Node driver registrar says "Lost connection" and do nothing

Open jim3ma opened this issue 3 years ago • 20 comments

Log:

I0323 21:20:55.915743       1 main.go:110] Version: unknown
I0323 21:20:55.915816       1 main.go:120] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0323 21:20:55.915833       1 connection.go:151] Connecting to unix:///csi/csi.sock
I0323 21:20:55.916332       1 main.go:127] Calling CSI driver to discover driver name
I0323 21:20:55.916348       1 connection.go:180] GRPC call: /csi.v1.Identity/GetPluginInfo
I0323 21:20:55.916354       1 connection.go:181] GRPC request: {}
I0323 21:20:55.917907       1 connection.go:183] GRPC response: {"name":"x","vendor_version":"0.3.0"}
I0323 21:20:55.918413       1 connection.go:184] GRPC error: <nil>
I0323 21:20:55.918421       1 main.go:137] CSI driver name: "x"
I0323 21:20:55.918477       1 node_register.go:58] Starting Registration Server at: /registration/x-reg.sock
I0323 21:20:55.918603       1 node_register.go:67] Registration Server started at: /registration/x-reg.sock
I0323 21:20:57.567609       1 main.go:77] Received GetInfo call: &InfoRequest{}
I0323 21:20:57.671681       1 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
E0323 21:21:15.690492       1 connection.go:129] Lost connection to unix:///csi/csi.sock.

After restart container, it works.

jim3ma avatar Mar 26 '21 07:03 jim3ma

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot avatar Jun 24 '21 07:06 fejta-bot

/remove-lifecycle stale

aramase avatar Jul 12 '21 17:07 aramase

To restart the container automatically, #152 would help you.

toshipp avatar Oct 07 '21 07:10 toshipp

I am experiencing the same behavior. Given all 3 containers (node-driver-registrar, secrets-store and liveness-probe) are running, if secrets-store gets restarted, node-driver-registrar loses connection to unix:///csi/csi.sock.

Node-driver-registrar logs:

I1012 11:51:22.299487 1 main.go:164] Version: v2.3.0 I1012 11:51:22.299515 1 main.go:165] Running node-driver-registrar in mode=registration I1012 11:51:22.299913 1 main.go:189] Attempting to open a gRPC connection with: "/csi/csi.sock" I1012 11:51:22.299936 1 connection.go:154] Connecting to unix:///csi/csi.sock I1012 11:51:22.300436 1 main.go:196] Calling CSI driver to discover driver name I1012 11:51:22.300451 1 connection.go:183] GRPC call: /csi.v1.Identity/GetPluginInfo I1012 11:51:22.300456 1 connection.go:184] GRPC request: {} I1012 11:51:22.303030 1 connection.go:186] GRPC response: {"name":"secrets-store.csi.k8s.io","vendor_version":"v0.3.0"} I1012 11:51:22.303071 1 connection.go:187] GRPC error: I1012 11:51:22.303078 1 main.go:206] CSI driver name: "secrets-store.csi.k8s.io" I1012 11:51:22.303115 1 node_register.go:52] Starting Registration Server at: /registration/secrets-store.csi.k8s.io-reg.sock I1012 11:51:22.303301 1 node_register.go:61] Registration Server started at: /registration/secrets-store.csi.k8s.io-reg.sock I1012 11:51:22.303447 1 node_register.go:91] Skipping healthz server because HTTP endpoint is set to: "" I1012 11:51:23.540046 1 main.go:100] Received GetInfo call: &InfoRequest{} I1012 11:51:23.540214 1 main.go:107] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/csi-secrets-store/registration" I1012 11:51:23.577520 1 main.go:118] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,} E1012 11:51:25.218944 1 connection.go:132] Lost connection to unix:///csi/csi.sock.

Secret store logs:

I1012 11:53:06.149183 1 exporter.go:33] metrics backend: prometheus I1012 11:53:06.205017 1 main.go:169] "provider health check enabled" interval="1m0s" I1012 11:53:06.205636 1 secrets-store.go:40] "Initializing Secrets Store CSI Driver" driver="secrets-store.csi.k8s.io" version="v0.3.0" buildTime="2021-08-30-18:38" I1012 11:53:06.205852 1 server.go:117] "Listening for connections" address="//csi/csi.sock" I1012 11:53:06.205923 1 main.go:174] starting manager I1012 11:53:06.206014 1 reconciler.go:131] starting rotation reconciler with poll interval: 10s

Liveness probe logs:

I1012 11:51:22.555987 1 main.go:149] calling CSI driver to discover driver name I1012 11:51:22.558382 1 main.go:155] CSI driver name: "secrets-store.csi.k8s.io" I1012 11:51:22.558401 1 main.go:183] ServeMux listening at "0.0.0.0:9808" E1012 11:52:05.885988 1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded W1012 11:52:12.886013 1 connection.go:173] Still connecting to unix:///csi/csi.sock E1012 11:52:20.885852 1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded W1012 11:52:22.885321 1 connection.go:173] Still connecting to unix:///csi/csi.sock W1012 11:52:27.886239 1 connection.go:173] Still connecting to unix:///csi/csi.sock W1012 11:52:32.885371 1 connection.go:173] Still connecting to unix:///csi/csi.sock E1012 11:52:35.885665 1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded W1012 11:52:37.886239 1 connection.go:173] Still connecting to unix:///csi/csi.sock W1012 11:52:42.885652 1 connection.go:173] Still connecting to unix:///csi/csi.sock W1012 11:52:42.885679 1 connection.go:173] Still connecting to unix:///csi/csi.sock W1012 11:52:47.885302 1 connection.go:173] Still connecting to unix:///csi/csi.sock E1012 11:52:50.885540 1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded W1012 11:52:52.885269 1 connection.go:173] Still connecting to unix:///csi/csi.sock W1012 11:52:52.886357 1 connection.go:173] Still connecting to unix:///csi/csi.sock W1012 11:52:57.885542 1 connection.go:173] Still connecting to unix:///csi/csi.sock W1012 11:52:57.885565 1 connection.go:173] Still connecting to unix:///csi/csi.sock W1012 11:53:02.885251 1 connection.go:173] Still connecting to unix:///csi/csi.sock W1012 11:53:02.885407 1 connection.go:173] Still connecting to unix:///csi/csi.sock E1012 11:53:05.885994 1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded

To restart the container automatically, #152 would help you.

That does not help, already have this:

  • name: node-driver-registrar image: "..." args: - --v=5 - --csi-address=/csi/csi.sock - --kubelet-registration-path=/var/lib/kubelet/plugins/csi-secrets-store/csi.sock livenessProbe: exec: command: - /csi-node-driver-registrar - --kubelet-registration-path=/var/lib/kubelet/plugins/csi-secrets-store/csi.sock - --mode=kubelet-registration-probe initialDelaySeconds: 3

yatanasov-hs avatar Oct 25 '21 07:10 yatanasov-hs

To restart the container automatically, #152 would help you.

This helps during the startup of node-driver-registrar if the kubelet plugin registration fails but these logs show that the kubelet plugin registration succeeded so the workaround in #152 can't help here:

I0323 21:20:57.567609       1 main.go:77] Received GetInfo call: &InfoRequest{}
I0323 21:20:57.671681       1 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}

@yatanasov-hs thanks for the logs, I see that liveness probe attempted to connect after node-driver-registrar logged lost connection, looking at the timestamps:

# node-driver-registrar
E1012 11:51:25.218944 1 connection.go:132] Lost connection to unix:///csi/csi.sock

# livenessprobe
E1012 11:52:05.885988 1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded
W1012 11:52:12.886013 1 connection.go:173] Still connecting to unix:///csi/csi.sock

The error Lost connection to {{address}} comes from https://github.com/kubernetes-csi/csi-lib-utils/blob/a2ccb594bb74b61a0655d3431c6365a6a567c7af/connection/connection.go?#L132, which by default will attempt to reconnect, I think I can add a log after a successful dial to rule out the dial timeout.

Both node-driver-registrar or livenessprobe should be able to reestablish the connection to csi.sock if they lost it so still not sure why they couldn't.

mauriciopoppe avatar Oct 25 '21 18:10 mauriciopoppe

I'm having same problem after upgrade okd 4.7 to 4.8.

chelobarreto avatar Oct 26 '21 11:10 chelobarreto

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 24 '22 12:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Feb 23 '22 13:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-triage-robot avatar Mar 25 '22 13:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 25 '22 13:03 k8s-ci-robot

Hi, I happened to see the same behaviour on k8s 1.24 and quay.io/k8scsi/csi-node-driver-registrar:v1.2.0, any idea of what might be causing this?

fradeve avatar May 31 '23 14:05 fradeve

/reopen /remove-lifecycle rotten /lifecycle frozen /kind bug

mauriciopoppe avatar Jun 01 '23 04:06 mauriciopoppe

@mauriciopoppe: Reopened this issue.

In response to this:

/reopen /remove-lifecycle rotten /lifecycle frozen /kind bug

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jun 01 '23 04:06 k8s-ci-robot

I got the same issue on AKS cluster in version 1.27.3. The csi-azurefile-node pods started failing on one of the node pools (1 of 3 node pools). Still looking into what caused it but best guess atm is something failing to startup properly after node upgrade during automatic maintenance window.

image

BertelBB avatar Oct 26 '23 13:10 BertelBB

I am also observing this on our AWS EKS 1.27.4. Exactly the same output as the comment above.

Rohlik avatar Nov 15 '23 09:11 Rohlik

@gnufied Does this PR https://github.com/kubernetes-csi/node-driver-registrar/pull/322 fixes the current(#139) issue ?

basaveswar-kureti avatar Dec 13 '23 13:12 basaveswar-kureti

It should help with yes.

gnufied avatar Dec 13 '23 17:12 gnufied

Thanks @gnufied for the confirmation. We tested this internally on a bunch of our k8s clusters which is using csi-node-driver-registrar version of v2.9.2 (so anything >v2.9.0) has not seen this issue.

basaveswar-kureti avatar Dec 13 '23 18:12 basaveswar-kureti

Thanks @gnufied, I wanted to expand on the effect of #322 (released in https://github.com/kubernetes-csi/node-driver-registrar/releases/tag/v2.9.0) and this issue.

The possible scenario where we might see errors like https://github.com/kubernetes-csi/node-driver-registrar/issues/139#issuecomment-950590661 or https://github.com/kubernetes-csi/node-driver-registrar/issues/139#issuecomment-1781103638:

  • E1012 11:52:05.885988 1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded
  • W1012 11:52:12.886013 1 connection.go:173] Still connecting to unix:///csi/csi.sock
  • E1012 11:51:25.218944 1 connection.go:132] Lost connection to unix:///csi/csi.sock.

Is if the csi.sock owned by the CSI Driver node component suddenly dies which could happen because the CSI Driver container restarted. Let's assume that the CSI Driver node component was restarted, the usual behavior for node-driver-registrar is to show Lost Connection to ... and to attempt reconnection, however it looks like after the restart, the CSI Driver node component creates the socket and node-driver-registrar can see the socket file available but can't connect to it.

#322 is adding a throughout check on the /healthz endpoint so that in addition to checking the existence of the socket file node-driver-registrar makes a successful CSI call (NodeGetInfo) against the socket, if the call fails consistently then eventually the container will be restarted. Note that for this workaround the deployment manifests must enable a livenessProbe probing the /healthz endpoint, for more info about the /healthz setup please check this example in the README.

mauriciopoppe avatar Dec 13 '23 18:12 mauriciopoppe

@mauriciopoppe sorry for late response, but the way I have seen this issue to happen is:

  1. CSI driver pod and registrar are running per-usual.
  2. Something causes another registrar container to be created (In my case, this happened because someone was trying to debug the driver-pod and created parallel container)
  3. When this happens, the new registrar creates a new registration socket file (which replaces old one, created in step#1) and since driver is still running, kubelet sees this new registration socket and everything works fine.
  4. Now when this debug container is terminated, it just leaves the registration socket file. But nobody is listening on it and since registrar doesn't have a mechanism of retrying, kubelet sees driver as removed.

So https://github.com/kubernetes-csi/node-driver-registrar/pull/322 mainly fixes issue with stale registration sockets, assuming health check is enabled on registration socket.

Now, I know that this issue has bugs linked where csi-driver socket is stale, but in current iteration of node-registrar, we only connect to csi-driver socket once on startup. We are in-fact closing the connection after retrieving driver-name and if for some reason node registrar can't connect to csi driver socket, it will simply exit on startup and restart itself. But as such - #322 does not fixes stale csi driver sockets issues.

gnufied avatar Jan 10 '24 19:01 gnufied