gcp-compute-persistent-disk-csi-driver
gcp-compute-persistent-disk-csi-driver copied to clipboard
Container csi-driver-registrar restarts when gce-pd-driver starts slower
We've met an issue on GKE when a new node is scaled up by the cluster-autoscaler.
The container csi-driver-registrar
in the pod pdcsi-node
could be restarted if gce-pd-driver
starts slower. The csi-driver-registrar fails to call the csi driver to get the driver info since the container gce-pd-driver
is not ready.
This is the log before csi-driver-registrar
restarts
Defaulted container "csi-driver-registrar" out of: csi-driver-registrar, gce-pd-driver
I0219 10:31:33.712910 1 main.go:135] Version: v2.9.0-gke.3-0-g098349f5
I0219 10:31:33.712983 1 main.go:136] Running node-driver-registrar in mode=
I0219 10:31:33.712988 1 main.go:157] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0219 10:31:33.713005 1 connection.go:213] Connecting to unix:///csi/csi.sock
W0219 10:31:43.714057 1 connection.go:232] Still connecting to unix:///csi/csi.sock
W0219 10:31:53.713714 1 connection.go:232] Still connecting to unix:///csi/csi.sock
W0219 10:32:03.714076 1 connection.go:232] Still connecting to unix:///csi/csi.sock
E0219 10:32:03.714104 1 main.go:160] error connecting to CSI driver: context deadline exceeded
This is the log of container gce-pd-driver
I0219 10:32:06.282916 1 main.go:90] Sys info: NumCPU: 16 MAXPROC: 1
I0219 10:32:06.283058 1 main.go:95] Driver vendor version v1.8.16-gke.0
I0219 10:32:06.283255 1 mount_linux.go:208] Detected OS without systemd
I0219 10:32:06.285310 1 gce-pd-driver.go:96] Enabling volume access mode: SINGLE_NODE_WRITER
I0219 10:32:06.285322 1 gce-pd-driver.go:96] Enabling volume access mode: MULTI_NODE_READER_ONLY
I0219 10:32:06.285325 1 gce-pd-driver.go:96] Enabling volume access mode: MULTI_NODE_MULTI_WRITER
I0219 10:32:06.285328 1 gce-pd-driver.go:106] Enabling controller service capability: CREATE_DELETE_VOLUME
I0219 10:32:06.285332 1 gce-pd-driver.go:106] Enabling controller service capability: PUBLISH_UNPUBLISH_VOLUME
I0219 10:32:06.285334 1 gce-pd-driver.go:106] Enabling controller service capability: CREATE_DELETE_SNAPSHOT
I0219 10:32:06.285337 1 gce-pd-driver.go:106] Enabling controller service capability: LIST_SNAPSHOTS
I0219 10:32:06.285339 1 gce-pd-driver.go:106] Enabling controller service capability: PUBLISH_READONLY
I0219 10:32:06.285342 1 gce-pd-driver.go:106] Enabling controller service capability: EXPAND_VOLUME
I0219 10:32:06.285345 1 gce-pd-driver.go:106] Enabling controller service capability: LIST_VOLUMES
I0219 10:32:06.285352 1 gce-pd-driver.go:106] Enabling controller service capability: LIST_VOLUMES_PUBLISHED_NODES
I0219 10:32:06.285357 1 gce-pd-driver.go:106] Enabling controller service capability: CLONE_VOLUME
I0219 10:32:06.285360 1 gce-pd-driver.go:116] Enabling node service capability: STAGE_UNSTAGE_VOLUME
I0219 10:32:06.285363 1 gce-pd-driver.go:116] Enabling node service capability: EXPAND_VOLUME
I0219 10:32:06.285366 1 gce-pd-driver.go:116] Enabling node service capability: GET_VOLUME_STATS
I0219 10:32:06.285374 1 gce-pd-driver.go:167] Driver: pd.csi.storage.gke.io
I0219 10:32:06.285409 1 server.go:106] Start listening with scheme unix, addr /csi/csi.sock
I0219 10:32:06.285640 1 server.go:125] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0219 10:32:07.354375 1 utils.go:66] /csi.v1.Identity/GetPluginInfo called with request:
I0219 10:32:07.354472 1 utils.go:75] /csi.v1.Identity/GetPluginInfo returned with response: name:"pd.csi.storage.gke.io" vendor_version:"v1.8.16-gke.0"
I0219 10:32:07.997047 1 utils.go:66] /csi.v1.Node/NodeGetInfo called with request:
I0219 10:32:07.997153 1 utils.go:75] /csi.v1.Node/NodeGetInfo returned with response: node_id:"projects/platform-prod-f112f5ae/zones/asia-southeast1-b/instances/gke-platform-prod-sg-prod-node-pool-ca34813d-jmf9" max_volumes_per_node:127 accessible_topology:<segments:<key:"topology.gke.io/zone" value:"asia-southeast1-b" > >
This is the pod status
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2024-02-19T10:31:22Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2024-02-19T10:32:08Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2024-02-19T10:32:08Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2024-02-19T10:31:22Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://be82a448e9c769a6b6528b0b0aa74aecc29cc26991d8a901ec60c5522f914417
image: sha256:93a8638b2f16576ce0312a90508183f8176fb4c047cfa3c4fadf82dd909fd136
imageID: gke.gcr.io/csi-node-driver-registrar@sha256:cce2f1e7860a0996711f55c4bab363f0a1a7f3ef7083017471c08e65d9c9c6e1
lastState:
terminated:
containerID: containerd://0d7b5b05876edc763372039cfb5a3932d70a399ee037044667efa81943abbf0d
exitCode: 1
finishedAt: "2024-02-19T10:32:03Z"
reason: Error
startedAt: "2024-02-19T10:31:33Z"
name: csi-driver-registrar
ready: true
restartCount: 1
started: true
state:
running:
startedAt: "2024-02-19T10:32:07Z"
- containerID: containerd://f7bbaf294d1ee2bbdbbb5cc774389be7be947f13528369ce42f263f9d77dc347
image: sha256:238d43027da3411f1c9177be11d4e7b73e6f2dcaed31cbee69172386314e3899
imageID: gke.gcr.io/gcp-compute-persistent-disk-csi-driver@sha256:fde2f273b43b9a5e075276a1b33a4199468dd7e4475e107e9a06a2f049e7a2cd
lastState: {}
name: gce-pd-driver
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2024-02-19T10:32:06Z"
hostIP: 10.72.6.130
phase: Running
podIP: 10.72.6.130
podIPs:
- ip: 10.72.6.130
qosClass: Burstable
startTime: "2024-02-19T10:31:22Z"