aws-ebs-csi-driver icon indicating copy to clipboard operation
aws-ebs-csi-driver copied to clipboard

Controller pod volumemodifier container sometimes pegs CPU

Open danports opened this issue 3 months ago • 4 comments

/kind bug

What happened? Occasionally the volumemodifier container in one of the EBS CSI controller pods starts using the maximum CPU possible (stuck in some kind of busy wait perhaps?). Here's a chart showing CPU usage for volumemodifier containers in a cluster over 3 days - there are only 2 vCPUs on the nodes so the container is basically using up all of the CPU time available: image You can see from the chart that the lease likely switched from one container to another a couple of times but the container continued pegging the CPU regardless of which pod it was running in.

What you expected to happen? Container should be using a miniscule amount of CPU time like it usually does.

How to reproduce it (as minimally and precisely as possible)? Good question! I've only noticed this problem fairly recently - I can try to track down when it started.

Anything else we need to know?: I have two replicas of the controller running, but usually only one of the replicas has a volumemodifier container running out of control, probably the one with the active lease. The logs for the container look like this:

2024-05-04 17:44:13.668	I0504 21:44:13.668099       1 reflector.go:351] Caches populated for *v1.PersistentVolumeClaim from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229
2024-05-04 17:44:13.666	I0504 21:44:13.665958       1 reflector.go:325] Listing and watching *v1.PersistentVolumeClaim from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229
2024-05-04 17:44:13.529	I0504 21:44:13.529067       1 reflector.go:351] Caches populated for *v1.PersistentVolume from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229
2024-05-04 17:44:13.526	I0504 21:44:13.526610       1 reflector.go:325] Listing and watching *v1.PersistentVolume from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229
2024-05-04 17:44:13.249	I0504 21:44:13.249540       1 reflector.go:351] Caches populated for *v1.Lease from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229
2024-05-04 17:44:13.229	I0504 21:44:13.228949       1 reflector.go:325] Listing and watching *v1.Lease from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229
2024-05-04 17:44:12.131	W0504 21:44:12.131235       1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.Lease ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2024-05-04 17:44:12.131	W0504 21:44:12.130947       1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.PersistentVolumeClaim ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2024-05-04 17:44:12.131	W0504 21:44:12.130939       1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.PersistentVolume ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2024-05-04 17:42:35.438	I0504 21:42:35.438669       1 reflector.go:800] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Watch close - *v1.PersistentVolumeClaim total 0 items received
2024-05-04 17:38:53.666	I0504 21:38:53.666134       1 reflector.go:800] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Watch close - *v1.PersistentVolume total 2 items received
2024-05-04 17:37:14.433	I0504 21:37:14.433435       1 reflector.go:800] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Watch close - *v1.PersistentVolumeClaim total 5 items received
2024-05-04 17:35:48.458	I0504 21:35:48.458834       1 reflector.go:800] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Watch close - *v1.Lease total 1247 items received
2024-05-04 17:35:18.092	I0504 21:35:18.092294       1 reflector.go:377] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: forcing resync
2024-05-04 17:35:18.054	I0504 21:35:18.054127       1 reflector.go:377] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: forcing resync

It seems like killing the CPU-hogging pod is sufficient to resolve the problem until it recurs.

Environment

  • Kubernetes version (use kubectl version): 1.29.4
  • Driver version: volumemodifier container is running public.ecr.aws/ebs-csi-driver/volume-modifier-for-k8s:v0.2.1@sha256:78c116f223997fa8d074846bf10e1a08cc0b723dc03c7a20900685442b5a3504

danports avatar May 04 '24 21:05 danports