kubernetes-elastic-agents icon indicating copy to clipboard operation
kubernetes-elastic-agents copied to clipboard

BUG | File Descriptor leak issue in v4.1.0-643

Open Sathyam-Hotstar opened this issue 8 months ago • 1 comments

We have observed a bug in v4.1.0-643 where this plugin is not closing the File descriptors that it opens as sockets and this leads to open files limit exhaustion of gocd-server and hence the server is deemed to be unresponsive.

To mitigate we downgraded to the v4.0.0-505 and it fixed the issue and the file descriptor count was stable throughout rather than a linearly increasing trend seen with v4.1.0-643 plugin.

Sharing Observations for FD leak issue in v4.1.0-643:

  • The plugin regularly creates eventfd and eventpoll anonymous inodes.
  • The plugin heavily creates vert.x-eventloop threads and do not release these threads even when the server is idle leading to continual cpu/memory utilisation increase.
  • With version v4.0.0-505 we do not see vert.x-eventloop threads and the File descriptors are created and released too.

GoCD Server Logs:

2025-03-29 03:30:16,273 INFO  [151@MessageListener for MaterialUpdateListener] MaterialDatabaseUpdater:124 - [Material Update] Modification check failed for material: [[email protected]:ORG/REPO.git, path=gocd-agent-python3-eks] cause: java.lang.RuntimeException: The plugin sent a response that could not be understood by Go. Plugin returned with code '500' and the following response: '"latest-revisions-since failed due to [git failed: Cannot run program \"git\" (in directory \".\"): error=24, Too many open files (git clone --branch=master --no-checkout [email protected]:ORG/REPO.git /go-working-dir/pipelines/flyweight/0f3f009d-c29a-4ee1-b8df-bc144978d824)], root cause [IOException: error=24, Too many open files]"'
2025-03-29 03:30:16,277 INFO  [156@MessageListener for MaterialUpdateListener] MaterialDatabaseUpdater:124 - [Material Update] Modification check failed for material: [[email protected]:ORG/REPO.git, path=gocd-agent-infra-eks] cause: java.lang.RuntimeException: The plugin sent a response that could not be understood by Go. Plugin returned with code '500' and the following response: '"latest-revisions-since failed due to [git failed: Cannot run program \"git\" (in directory \".\"): error=24, Too many open files (git clone --branch=master --no-checkout [email protected]:ORG/REPO.git /go-working-dir/pipelines/flyweight/a79e628a-e44a-4e31-9898-2930fed6907e)], root cause [IOException: error=24, Too many open files]"'
2025-03-29 03:30:16,282 INFO  [153@MessageListener for MaterialUpdateListener] MaterialDatabaseUpdater:124 - [Material Update] Modification check failed for material: [[email protected]:ORG/REPO.git, path=gocd-agent-terraform-eks] cause: java.lang.RuntimeException: The plugin sent a response that could not be understood by Go. Plugin returned with code '500' and the following response: '"latest-revisions-since failed due to [git failed: Cannot run program \"git\" (in directory \".\"): error=24, Too many open files (git clone --branch=master --no-checkout [email protected]:ORG/REPO.git /go-working-dir/pipelines/flyweight/f5528b96-ffca-48a3-ab29-b0b0fec7ddb4)], root cause [IOException: error=24, Too many open files]"'

List of open files from the gocd server pod:

246	/gocd-jre/bin/java	876	anon_inode:[eventpoll]
246	/gocd-jre/bin/java	877	anon_inode:[eventfd]
246	/gocd-jre/bin/java	878	anon_inode:[eventpoll]
246	/gocd-jre/bin/java	879	anon_inode:[eventpoll]
246	/gocd-jre/bin/java	880	anon_inode:[eventfd]
246	/gocd-jre/bin/java	881	anon_inode:[eventfd]
246	/gocd-jre/bin/java	882	anon_inode:[eventpoll]
246	/gocd-jre/bin/java	883	anon_inode:[eventfd]
246	/gocd-jre/bin/java	884	anon_inode:[eventpoll]
246	/gocd-jre/bin/java	885	anon_inode:[eventfd]
246	/gocd-jre/bin/java	886	anon_inode:[eventpoll]
246	/gocd-jre/bin/java	887	anon_inode:[eventfd]
246	/gocd-jre/bin/java	888	anon_inode:[eventpoll]
246	/gocd-jre/bin/java	889	anon_inode:[eventfd]
246	/gocd-jre/bin/java	890	anon_inode:[eventpoll]
246	/gocd-jre/bin/java	891	anon_inode:[eventfd]
246	/gocd-jre/bin/java	892	anon_inode:[eventpoll]
246	/gocd-jre/bin/java	893	anon_inode:[eventfd]
246	/gocd-jre/bin/java	894	anon_inode:[eventpoll]
246	/gocd-jre/bin/java	895	anon_inode:[eventfd]
246	/gocd-jre/bin/java	896	anon_inode:[eventpoll]
Image

Sathyam-Hotstar avatar Mar 31 '25 19:03 Sathyam-Hotstar

Thanks, must be some change in the way the underlying Kubernetes client library works as there was a major version change (v6 to v7).between these two plugin versions.

chadlwilson avatar Mar 31 '25 21:03 chadlwilson

Should be resolved with https://github.com/gocd/kubernetes-elastic-agents/releases/tag/v4.1.1-661

The way the plugin cached cluster connections has always been rather dodgy, but this became much worse with the newer Kubernetes client library. This would be much worse if you're in a situation with many different cluster profiles, since the client only attempts to cache a single client at a time (rather than a pool), so will keep creating new clients and leaking resources from the old clients.

chadlwilson avatar Apr 05 '25 08:04 chadlwilson