Anton Lindholm
Anton Lindholm
We get this every once in a while for a cronjob we are running. vSphere 6.7 u3 and 2.2.1 running on k8s 1.20 ``` 2021-06-22 15:08:19 E0622 12:08:19.031139 3413570 nestedpendingoperations.go:301]...
Happened again just now: logs: ``` 2021-07-05 18:05:28 I0705 15:05:28.023372 312924 operation_generator.go:565] MountVolume.WaitForAttach succeeded for volume "pvc-9dd26215-c898-4942-a9c5-8b218bd399e8" (UniqueName: "kubernetes.io/csi/csi.vsphere.vmware.com^73926cf3-8899-464e-a15f-25e9119fb1b9") pod "importer-daily-1625493660-w27sl" (UID: "c02445fc-ee87-4087-ae68-843760816a57") DevicePath "csi-3fbfa71984550731ed0b33a6863054fd0317e64677d964fd64e9a93f6ab414a8" Show context 2021-07-05 18:05:27 I0705...
Pod had been initializing over an hour due to volumes not being mounted correctly.
Logs from the worker node where the job succeeded earlier: ``` 2021-07-05 14:02:15 I0705 11:02:15.184802 2312548 reconciler.go:319] Volume detached for volume "pvc-9dd26215-c898-4942-a9c5-8b218bd399e8" (UniqueName: "kubernetes.io/csi/csi.vsphere.vmware.com^73926cf3-8899-464e-a15f-25e9119fb1b9") on node "prod-us-worker7" DevicePath "csi-31fd1cfccdcd792090abfb1c4ab05dd83673b5c7cd807ca4f2983a33d1e548f0" Show...
Seems a quick fix for when this happens is to cordon the worker where the workload fails to start. Redeploy the workload and then delete the volumeattachment.
One thing a team member mentioned today was that this error only seems to affect certain worker nodes 🤔
``` 2021-07-05 18:05:25 | time="2021-07-05T15:05:25Z" level=debug msg="/csi.v1.Node/NodeGetCapabilities: REQ 31276: XXX_NoUnkeyedLiteral={}, XXX_sizecache=0" -- | -- Â | Â | 2021-07-05 18:05:25 | time="2021-07-05T15:05:25Z" level=debug msg="/csi.v1.Node/NodeUnpublishVolume: REP 31275: rpc error: code = Internal desc...
Can do, it needs to fail again though, since we "fixed" the failed pod by cordoning the node and then restarting the job.
We just experienced this again for a deployment in another cluster. A quick fix is to cordon the node where the pod is failing to start up, delete the volume...
This is still happening in a cluster running 2.3.0. We will upgrade to 2.4.x now, and afterwards I will bump the log levels to see what is actually going on..