gcp-compute-persistent-disk-csi-driver
gcp-compute-persistent-disk-csi-driver copied to clipboard
Container csi-driver-registrar restarts when gce-pd-driver starts slower
We've met an issue on GKE when a new node is scaled up by the cluster-autoscaler.
The container csi-driver-registrar in the pod pdcsi-node could be restarted if gce-pd-driver starts slower. The csi-driver-registrar fails to call the csi driver to get the driver info since the container gce-pd-driver is not ready.
This is the log before csi-driver-registrar restarts
Defaulted container "csi-driver-registrar" out of: csi-driver-registrar, gce-pd-driver
I0219 10:31:33.712910 1 main.go:135] Version: v2.9.0-gke.3-0-g098349f5
I0219 10:31:33.712983 1 main.go:136] Running node-driver-registrar in mode=
I0219 10:31:33.712988 1 main.go:157] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0219 10:31:33.713005 1 connection.go:213] Connecting to unix:///csi/csi.sock
W0219 10:31:43.714057 1 connection.go:232] Still connecting to unix:///csi/csi.sock
W0219 10:31:53.713714 1 connection.go:232] Still connecting to unix:///csi/csi.sock
W0219 10:32:03.714076 1 connection.go:232] Still connecting to unix:///csi/csi.sock
E0219 10:32:03.714104 1 main.go:160] error connecting to CSI driver: context deadline exceeded
This is the log of container gce-pd-driver
I0219 10:32:06.282916 1 main.go:90] Sys info: NumCPU: 16 MAXPROC: 1
I0219 10:32:06.283058 1 main.go:95] Driver vendor version v1.8.16-gke.0
I0219 10:32:06.283255 1 mount_linux.go:208] Detected OS without systemd
I0219 10:32:06.285310 1 gce-pd-driver.go:96] Enabling volume access mode: SINGLE_NODE_WRITER
I0219 10:32:06.285322 1 gce-pd-driver.go:96] Enabling volume access mode: MULTI_NODE_READER_ONLY
I0219 10:32:06.285325 1 gce-pd-driver.go:96] Enabling volume access mode: MULTI_NODE_MULTI_WRITER
I0219 10:32:06.285328 1 gce-pd-driver.go:106] Enabling controller service capability: CREATE_DELETE_VOLUME
I0219 10:32:06.285332 1 gce-pd-driver.go:106] Enabling controller service capability: PUBLISH_UNPUBLISH_VOLUME
I0219 10:32:06.285334 1 gce-pd-driver.go:106] Enabling controller service capability: CREATE_DELETE_SNAPSHOT
I0219 10:32:06.285337 1 gce-pd-driver.go:106] Enabling controller service capability: LIST_SNAPSHOTS
I0219 10:32:06.285339 1 gce-pd-driver.go:106] Enabling controller service capability: PUBLISH_READONLY
I0219 10:32:06.285342 1 gce-pd-driver.go:106] Enabling controller service capability: EXPAND_VOLUME
I0219 10:32:06.285345 1 gce-pd-driver.go:106] Enabling controller service capability: LIST_VOLUMES
I0219 10:32:06.285352 1 gce-pd-driver.go:106] Enabling controller service capability: LIST_VOLUMES_PUBLISHED_NODES
I0219 10:32:06.285357 1 gce-pd-driver.go:106] Enabling controller service capability: CLONE_VOLUME
I0219 10:32:06.285360 1 gce-pd-driver.go:116] Enabling node service capability: STAGE_UNSTAGE_VOLUME
I0219 10:32:06.285363 1 gce-pd-driver.go:116] Enabling node service capability: EXPAND_VOLUME
I0219 10:32:06.285366 1 gce-pd-driver.go:116] Enabling node service capability: GET_VOLUME_STATS
I0219 10:32:06.285374 1 gce-pd-driver.go:167] Driver: pd.csi.storage.gke.io
I0219 10:32:06.285409 1 server.go:106] Start listening with scheme unix, addr /csi/csi.sock
I0219 10:32:06.285640 1 server.go:125] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0219 10:32:07.354375 1 utils.go:66] /csi.v1.Identity/GetPluginInfo called with request:
I0219 10:32:07.354472 1 utils.go:75] /csi.v1.Identity/GetPluginInfo returned with response: name:"pd.csi.storage.gke.io" vendor_version:"v1.8.16-gke.0"
I0219 10:32:07.997047 1 utils.go:66] /csi.v1.Node/NodeGetInfo called with request:
I0219 10:32:07.997153 1 utils.go:75] /csi.v1.Node/NodeGetInfo returned with response: node_id:"projects/platform-prod-f112f5ae/zones/asia-southeast1-b/instances/gke-platform-prod-sg-prod-node-pool-ca34813d-jmf9" max_volumes_per_node:127 accessible_topology:<segments:<key:"topology.gke.io/zone" value:"asia-southeast1-b" > >
This is the pod status
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2024-02-19T10:31:22Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2024-02-19T10:32:08Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2024-02-19T10:32:08Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2024-02-19T10:31:22Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://be82a448e9c769a6b6528b0b0aa74aecc29cc26991d8a901ec60c5522f914417
image: sha256:93a8638b2f16576ce0312a90508183f8176fb4c047cfa3c4fadf82dd909fd136
imageID: gke.gcr.io/csi-node-driver-registrar@sha256:cce2f1e7860a0996711f55c4bab363f0a1a7f3ef7083017471c08e65d9c9c6e1
lastState:
terminated:
containerID: containerd://0d7b5b05876edc763372039cfb5a3932d70a399ee037044667efa81943abbf0d
exitCode: 1
finishedAt: "2024-02-19T10:32:03Z"
reason: Error
startedAt: "2024-02-19T10:31:33Z"
name: csi-driver-registrar
ready: true
restartCount: 1
started: true
state:
running:
startedAt: "2024-02-19T10:32:07Z"
- containerID: containerd://f7bbaf294d1ee2bbdbbb5cc774389be7be947f13528369ce42f263f9d77dc347
image: sha256:238d43027da3411f1c9177be11d4e7b73e6f2dcaed31cbee69172386314e3899
imageID: gke.gcr.io/gcp-compute-persistent-disk-csi-driver@sha256:fde2f273b43b9a5e075276a1b33a4199468dd7e4475e107e9a06a2f049e7a2cd
lastState: {}
name: gce-pd-driver
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2024-02-19T10:32:06Z"
hostIP: 10.72.6.130
phase: Running
podIP: 10.72.6.130
podIPs:
- ip: 10.72.6.130
qosClass: Burstable
startTime: "2024-02-19T10:31:22Z"
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/reopen
We are experiencing the same issue. When a new node is added the pdcsi pod is restarted.
/reopen
Still happening on GKE version 1.30.8
@MrCoffey: You can't reopen an issue/PR unless you authored it or you are a collaborator.
In response to this:
/reopen
Still happening on GKE version 1.30.8
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Even if the registrar restarts, it will eventually reconcile. So this doesn't seem a priority to solve.