MINOR Capture heap dump after OOM on CI
We have seen a few OOM errors on trunk lately. This patch adds the ability to capture a heap dump when this happens so we can better determine if the error was due to something in Gradle or within our own tests (like a memory leak).
It looks like the UserQuotaTest is the likely culprit.
From https://github.com/apache/kafka/actions/runs/13523225366/job/37791507296
Gradle Test Run :core:test > Gradle Test Executor 38 > UserQuotaTest > testQuotaOverrideDelete(String, String) > testQuotaOverrideDelete(String, String).quorum=kraft.groupProtocol=consumer STARTED
> Task :storage:compileTestJava
Unexpected exception thrown.
org.gradle.internal.remote.internal.MessageIOException: Could not read message from '/127.0.0.1:50402'.
at org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:99)
at org.gradle.internal.remote.internal.hub.MessageHub$ConnectionReceive.run(MessageHub.java:270)
at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64)
at org.gradle.internal.concurrent.AbstractManagedExecutor$1.run(AbstractManagedExecutor.java:48)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
and from https://github.com/apache/kafka/actions/runs/13534291834/job/37823471305
Gradle Test Run :core:test > Gradle Test Executor 37 > UserQuotaTest > testQuotaOverrideDelete(String, String) > testQuotaOverrideDelete(String, String).quorum=kraft.groupProtocol=consumer STARTED
> Task :storage:checkstyleMain
> Task :shell:checkstyleMain
Unexpected exception thrown.
org.gradle.internal.remote.internal.MessageIOException: Could not read message from '/127.0.0.1:38838'.
at org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:99)
at org.gradle.internal.remote.internal.hub.MessageHub$ConnectionReceive.run(MessageHub.java:270)
at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64)
at org.gradle.internal.concurrent.AbstractManagedExecutor$1.run(AbstractManagedExecutor.java:48)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
It appears that the Gradle worker is trying to send results to the main process which causes a long GC pause which triggers this "GC overhead limit exceeded" error.
Seems to have reproduced here: https://github.com/apache/kafka/actions/runs/13550598471/job/37873138268?pr=19031
No activity for a while after
Wed, 26 Feb 2025 18:35:41 GMT > Task :streams:test-utils:copyDependantLibs
Wed, 26 Feb 2025 18:52:51 GMT > Task :streams:test-utils:jar
Wed, 26 Feb 2025 18:55:06 GMT > Task :connect:runtime:compileJava
This suggests that -XX:-UseGCOverheadLimit is working as expected. However, it also suggests that there is a real memory leak or something. This run included a larger heap of 3gb.
I've not been able to reproduce the Gradle OOM that we're seeing on trunk, however I saw a different OOM over on my fork.
https://github.com/mumrah/kafka/actions/runs/13639578283/job/38126853058
2025-03-03T21:58:41.1436187Z Gradle Test Run :connect:runtime:test > Gradle Test Executor 71 > ConnectWorkerIntegrationTest > testPauseStopResume() STARTED
2025-03-03T21:58:43.8433931Z
2025-03-03T21:58:43.8434407Z java.lang.OutOfMemoryError: Java heap space
2025-03-03T21:58:43.8434975Z Dumping heap to /home/runner/work/kafka/kafka/heap-dumps/java_pid471589.hprof ...
2025-03-03T21:58:46.0433855Z
2025-03-03T21:58:46.0434629Z > Task :connect:runtime:test
2025-03-03T21:58:46.0435151Z
(Heap dump was uploaded here https://github.com/mumrah/kafka/actions/runs/13639578283/artifacts/2685005476)
This at least shows that the OOM arguments and heap dump archiving are working.
@chia7712 Even with the recent JUnit heap increase, we are seeing OOM on trunk https://github.com/apache/kafka/actions/runs/13807500150/job/38622104924#step:6:616
Can you take a look at this one?
Can you take a look at this one?
roger that
UserQuotaTest consumes over 1GB of memory on my local machine. I have opened https://issues.apache.org/jira/browse/KAFKA-18989 and https://issues.apache.org/jira/browse/KAFKA-18990 for memory usage optimization. Based on my local testing, merging the changes from those tickets reduces memory consumption from 1GB to ~500MB
However, the memory offered by Github action is 16GB. Our CI executes four workers, each running four tests in parallel. Consequently, if a PR happens to encounter a set of 'expensive' tests simultaneously, it may exceed the memory limit.
@chia7712 thanks for investigating. The metric creation inside a loop sounds like the likely culprit.
WDYT about merging this PR @chia7712? I think it would help us diagnose OOM much faster if we had a heap dump (which of course includes the thread dump in it)
This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please leave a comment asking for a review. If the PR has merge conflicts, update it with the latest from the base branch.
If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact).
If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.
This PR has been closed since it has not had any activity in 120 days. If you feel like this was a mistake, or you would like to continue working on it, please feel free to re-open the PR and ask for a review.