gcp-compute-persistent-disk-csi-driver Container csi-driver-registrar restarts when gce-pd-driver starts slower

trafficstars

We've met an issue on GKE when a new node is scaled up by the cluster-autoscaler.

The container csi-driver-registrar in the pod pdcsi-node could be restarted if gce-pd-driver starts slower. The csi-driver-registrar fails to call the csi driver to get the driver info since the container gce-pd-driver is not ready.

This is the log before csi-driver-registrar restarts

Defaulted container "csi-driver-registrar" out of: csi-driver-registrar, gce-pd-driver
I0219 10:31:33.712910       1 main.go:135] Version: v2.9.0-gke.3-0-g098349f5
I0219 10:31:33.712983       1 main.go:136] Running node-driver-registrar in mode=
I0219 10:31:33.712988       1 main.go:157] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0219 10:31:33.713005       1 connection.go:213] Connecting to unix:///csi/csi.sock
W0219 10:31:43.714057       1 connection.go:232] Still connecting to unix:///csi/csi.sock
W0219 10:31:53.713714       1 connection.go:232] Still connecting to unix:///csi/csi.sock
W0219 10:32:03.714076       1 connection.go:232] Still connecting to unix:///csi/csi.sock
E0219 10:32:03.714104       1 main.go:160] error connecting to CSI driver: context deadline exceeded

This is the log of container gce-pd-driver

I0219 10:32:06.282916       1 main.go:90] Sys info: NumCPU: 16 MAXPROC: 1
I0219 10:32:06.283058       1 main.go:95] Driver vendor version v1.8.16-gke.0
I0219 10:32:06.283255       1 mount_linux.go:208] Detected OS without systemd
I0219 10:32:06.285310       1 gce-pd-driver.go:96] Enabling volume access mode: SINGLE_NODE_WRITER
I0219 10:32:06.285322       1 gce-pd-driver.go:96] Enabling volume access mode: MULTI_NODE_READER_ONLY
I0219 10:32:06.285325       1 gce-pd-driver.go:96] Enabling volume access mode: MULTI_NODE_MULTI_WRITER
I0219 10:32:06.285328       1 gce-pd-driver.go:106] Enabling controller service capability: CREATE_DELETE_VOLUME
I0219 10:32:06.285332       1 gce-pd-driver.go:106] Enabling controller service capability: PUBLISH_UNPUBLISH_VOLUME
I0219 10:32:06.285334       1 gce-pd-driver.go:106] Enabling controller service capability: CREATE_DELETE_SNAPSHOT
I0219 10:32:06.285337       1 gce-pd-driver.go:106] Enabling controller service capability: LIST_SNAPSHOTS
I0219 10:32:06.285339       1 gce-pd-driver.go:106] Enabling controller service capability: PUBLISH_READONLY
I0219 10:32:06.285342       1 gce-pd-driver.go:106] Enabling controller service capability: EXPAND_VOLUME
I0219 10:32:06.285345       1 gce-pd-driver.go:106] Enabling controller service capability: LIST_VOLUMES
I0219 10:32:06.285352       1 gce-pd-driver.go:106] Enabling controller service capability: LIST_VOLUMES_PUBLISHED_NODES
I0219 10:32:06.285357       1 gce-pd-driver.go:106] Enabling controller service capability: CLONE_VOLUME
I0219 10:32:06.285360       1 gce-pd-driver.go:116] Enabling node service capability: STAGE_UNSTAGE_VOLUME
I0219 10:32:06.285363       1 gce-pd-driver.go:116] Enabling node service capability: EXPAND_VOLUME
I0219 10:32:06.285366       1 gce-pd-driver.go:116] Enabling node service capability: GET_VOLUME_STATS
I0219 10:32:06.285374       1 gce-pd-driver.go:167] Driver: pd.csi.storage.gke.io
I0219 10:32:06.285409       1 server.go:106] Start listening with scheme unix, addr /csi/csi.sock
I0219 10:32:06.285640       1 server.go:125] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0219 10:32:07.354375       1 utils.go:66] /csi.v1.Identity/GetPluginInfo called with request:
I0219 10:32:07.354472       1 utils.go:75] /csi.v1.Identity/GetPluginInfo returned with response: name:"pd.csi.storage.gke.io" vendor_version:"v1.8.16-gke.0"
I0219 10:32:07.997047       1 utils.go:66] /csi.v1.Node/NodeGetInfo called with request:
I0219 10:32:07.997153       1 utils.go:75] /csi.v1.Node/NodeGetInfo returned with response: node_id:"projects/platform-prod-f112f5ae/zones/asia-southeast1-b/instances/gke-platform-prod-sg-prod-node-pool-ca34813d-jmf9" max_volumes_per_node:127 accessible_topology:<segments:<key:"topology.gke.io/zone" value:"asia-southeast1-b" > >

This is the pod status

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-02-19T10:31:22Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-02-19T10:32:08Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-02-19T10:32:08Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-02-19T10:31:22Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://be82a448e9c769a6b6528b0b0aa74aecc29cc26991d8a901ec60c5522f914417
    image: sha256:93a8638b2f16576ce0312a90508183f8176fb4c047cfa3c4fadf82dd909fd136
    imageID: gke.gcr.io/csi-node-driver-registrar@sha256:cce2f1e7860a0996711f55c4bab363f0a1a7f3ef7083017471c08e65d9c9c6e1
    lastState:
      terminated:
        containerID: containerd://0d7b5b05876edc763372039cfb5a3932d70a399ee037044667efa81943abbf0d
        exitCode: 1
        finishedAt: "2024-02-19T10:32:03Z"
        reason: Error
        startedAt: "2024-02-19T10:31:33Z"
    name: csi-driver-registrar
    ready: true
    restartCount: 1
    started: true
    state:
      running:
        startedAt: "2024-02-19T10:32:07Z"
  - containerID: containerd://f7bbaf294d1ee2bbdbbb5cc774389be7be947f13528369ce42f263f9d77dc347
    image: sha256:238d43027da3411f1c9177be11d4e7b73e6f2dcaed31cbee69172386314e3899
    imageID: gke.gcr.io/gcp-compute-persistent-disk-csi-driver@sha256:fde2f273b43b9a5e075276a1b33a4199468dd7e4475e107e9a06a2f049e7a2cd
    lastState: {}
    name: gce-pd-driver
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2024-02-19T10:32:06Z"
  hostIP: 10.72.6.130
  phase: Running
  podIP: 10.72.6.130
  podIPs:
  - ip: 10.72.6.130
  qosClass: Burstable
  startTime: "2024-02-19T10:31:22Z"

Feb 23 '24 16:02 awx-fuyuanchu

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

May 23 '24 17:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jun 22 '24 17:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Jul 22 '24 17:07 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jul 22 '24 17:07 k8s-ci-robot

/reopen

We are experiencing the same issue. When a new node is added the pdcsi pod is restarted.

Oct 31 '24 16:10 Nickgoslinga

/reopen

Still happening on GKE version 1.30.8

Jan 22 '25 02:01 MrCoffey

@MrCoffey: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Still happening on GKE version 1.30.8

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jan 22 '25 02:01 k8s-ci-robot

Even if the registrar restarts, it will eventually reconcile. So this doesn't seem a priority to solve.

Jan 22 '25 19:01 mattcary

gcp-compute-persistent-disk-csi-driver gcp-compute-persistent-disk-csi-driver copied to clipboard

Container csi-driver-registrar restarts when gce-pd-driver starts slower

gcp-compute-persistent-disk-csi-driver
gcp-compute-persistent-disk-csi-driver copied to clipboard