kubernetes-elastic-agents
kubernetes-elastic-agents copied to clipboard
BUG | File Descriptor leak issue in v4.1.0-643
We have observed a bug in v4.1.0-643 where this plugin is not closing the File descriptors that it opens as sockets and this leads to open files limit exhaustion of gocd-server and hence the server is deemed to be unresponsive.
To mitigate we downgraded to the v4.0.0-505 and it fixed the issue and the file descriptor count was stable throughout rather than a linearly increasing trend seen with v4.1.0-643 plugin.
Sharing Observations for FD leak issue in v4.1.0-643:
- The plugin regularly creates eventfd and eventpoll anonymous inodes.
- The plugin heavily creates vert.x-eventloop threads and do not release these threads even when the server is idle leading to continual cpu/memory utilisation increase.
- With version v4.0.0-505 we do not see vert.x-eventloop threads and the File descriptors are created and released too.
GoCD Server Logs:
2025-03-29 03:30:16,273 INFO [151@MessageListener for MaterialUpdateListener] MaterialDatabaseUpdater:124 - [Material Update] Modification check failed for material: [[email protected]:ORG/REPO.git, path=gocd-agent-python3-eks] cause: java.lang.RuntimeException: The plugin sent a response that could not be understood by Go. Plugin returned with code '500' and the following response: '"latest-revisions-since failed due to [git failed: Cannot run program \"git\" (in directory \".\"): error=24, Too many open files (git clone --branch=master --no-checkout [email protected]:ORG/REPO.git /go-working-dir/pipelines/flyweight/0f3f009d-c29a-4ee1-b8df-bc144978d824)], root cause [IOException: error=24, Too many open files]"'
2025-03-29 03:30:16,277 INFO [156@MessageListener for MaterialUpdateListener] MaterialDatabaseUpdater:124 - [Material Update] Modification check failed for material: [[email protected]:ORG/REPO.git, path=gocd-agent-infra-eks] cause: java.lang.RuntimeException: The plugin sent a response that could not be understood by Go. Plugin returned with code '500' and the following response: '"latest-revisions-since failed due to [git failed: Cannot run program \"git\" (in directory \".\"): error=24, Too many open files (git clone --branch=master --no-checkout [email protected]:ORG/REPO.git /go-working-dir/pipelines/flyweight/a79e628a-e44a-4e31-9898-2930fed6907e)], root cause [IOException: error=24, Too many open files]"'
2025-03-29 03:30:16,282 INFO [153@MessageListener for MaterialUpdateListener] MaterialDatabaseUpdater:124 - [Material Update] Modification check failed for material: [[email protected]:ORG/REPO.git, path=gocd-agent-terraform-eks] cause: java.lang.RuntimeException: The plugin sent a response that could not be understood by Go. Plugin returned with code '500' and the following response: '"latest-revisions-since failed due to [git failed: Cannot run program \"git\" (in directory \".\"): error=24, Too many open files (git clone --branch=master --no-checkout [email protected]:ORG/REPO.git /go-working-dir/pipelines/flyweight/f5528b96-ffca-48a3-ab29-b0b0fec7ddb4)], root cause [IOException: error=24, Too many open files]"'
List of open files from the gocd server pod:
246 /gocd-jre/bin/java 876 anon_inode:[eventpoll]
246 /gocd-jre/bin/java 877 anon_inode:[eventfd]
246 /gocd-jre/bin/java 878 anon_inode:[eventpoll]
246 /gocd-jre/bin/java 879 anon_inode:[eventpoll]
246 /gocd-jre/bin/java 880 anon_inode:[eventfd]
246 /gocd-jre/bin/java 881 anon_inode:[eventfd]
246 /gocd-jre/bin/java 882 anon_inode:[eventpoll]
246 /gocd-jre/bin/java 883 anon_inode:[eventfd]
246 /gocd-jre/bin/java 884 anon_inode:[eventpoll]
246 /gocd-jre/bin/java 885 anon_inode:[eventfd]
246 /gocd-jre/bin/java 886 anon_inode:[eventpoll]
246 /gocd-jre/bin/java 887 anon_inode:[eventfd]
246 /gocd-jre/bin/java 888 anon_inode:[eventpoll]
246 /gocd-jre/bin/java 889 anon_inode:[eventfd]
246 /gocd-jre/bin/java 890 anon_inode:[eventpoll]
246 /gocd-jre/bin/java 891 anon_inode:[eventfd]
246 /gocd-jre/bin/java 892 anon_inode:[eventpoll]
246 /gocd-jre/bin/java 893 anon_inode:[eventfd]
246 /gocd-jre/bin/java 894 anon_inode:[eventpoll]
246 /gocd-jre/bin/java 895 anon_inode:[eventfd]
246 /gocd-jre/bin/java 896 anon_inode:[eventpoll]
Thanks, must be some change in the way the underlying Kubernetes client library works as there was a major version change (v6 to v7).between these two plugin versions.
Should be resolved with https://github.com/gocd/kubernetes-elastic-agents/releases/tag/v4.1.1-661
The way the plugin cached cluster connections has always been rather dodgy, but this became much worse with the newer Kubernetes client library. This would be much worse if you're in a situation with many different cluster profiles, since the client only attempts to cache a single client at a time (rather than a pool), so will keep creating new clients and leaking resources from the old clients.