kafka icon indicating copy to clipboard operation
kafka copied to clipboard

MINOR Capture heap dump after OOM on CI

Open mumrah opened this issue 10 months ago • 9 comments

We have seen a few OOM errors on trunk lately. This patch adds the ability to capture a heap dump when this happens so we can better determine if the error was due to something in Gradle or within our own tests (like a memory leak).

mumrah avatar Feb 25 '25 23:02 mumrah

It looks like the UserQuotaTest is the likely culprit.

From https://github.com/apache/kafka/actions/runs/13523225366/job/37791507296

Gradle Test Run :core:test > Gradle Test Executor 38 > UserQuotaTest > testQuotaOverrideDelete(String, String) > testQuotaOverrideDelete(String, String).quorum=kraft.groupProtocol=consumer STARTED

> Task :storage:compileTestJava
Unexpected exception thrown.
org.gradle.internal.remote.internal.MessageIOException: Could not read message from '/127.0.0.1:50402'.

	at org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:99)
	at org.gradle.internal.remote.internal.hub.MessageHub$ConnectionReceive.run(MessageHub.java:270)
	at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64)
	at org.gradle.internal.concurrent.AbstractManagedExecutor$1.run(AbstractManagedExecutor.java:48)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded

and from https://github.com/apache/kafka/actions/runs/13534291834/job/37823471305

Gradle Test Run :core:test > Gradle Test Executor 37 > UserQuotaTest > testQuotaOverrideDelete(String, String) > testQuotaOverrideDelete(String, String).quorum=kraft.groupProtocol=consumer STARTED

> Task :storage:checkstyleMain
> Task :shell:checkstyleMain
Unexpected exception thrown.

org.gradle.internal.remote.internal.MessageIOException: Could not read message from '/127.0.0.1:38838'.
	at org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:99)
	at org.gradle.internal.remote.internal.hub.MessageHub$ConnectionReceive.run(MessageHub.java:270)
	at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64)
	at org.gradle.internal.concurrent.AbstractManagedExecutor$1.run(AbstractManagedExecutor.java:48)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded

It appears that the Gradle worker is trying to send results to the main process which causes a long GC pause which triggers this "GC overhead limit exceeded" error.

mumrah avatar Feb 26 '25 16:02 mumrah

Seems to have reproduced here: https://github.com/apache/kafka/actions/runs/13550598471/job/37873138268?pr=19031

No activity for a while after

Wed, 26 Feb 2025 18:35:41 GMT > Task :streams:test-utils:copyDependantLibs
Wed, 26 Feb 2025 18:52:51 GMT > Task :streams:test-utils:jar
Wed, 26 Feb 2025 18:55:06 GMT > Task :connect:runtime:compileJava

This suggests that -XX:-UseGCOverheadLimit is working as expected. However, it also suggests that there is a real memory leak or something. This run included a larger heap of 3gb.

mumrah avatar Feb 26 '25 19:02 mumrah

I've not been able to reproduce the Gradle OOM that we're seeing on trunk, however I saw a different OOM over on my fork.

https://github.com/mumrah/kafka/actions/runs/13639578283/job/38126853058

2025-03-03T21:58:41.1436187Z Gradle Test Run :connect:runtime:test > Gradle Test Executor 71 > ConnectWorkerIntegrationTest > testPauseStopResume() STARTED
2025-03-03T21:58:43.8433931Z 
2025-03-03T21:58:43.8434407Z java.lang.OutOfMemoryError: Java heap space
2025-03-03T21:58:43.8434975Z Dumping heap to /home/runner/work/kafka/kafka/heap-dumps/java_pid471589.hprof ...
2025-03-03T21:58:46.0433855Z 
2025-03-03T21:58:46.0434629Z > Task :connect:runtime:test
2025-03-03T21:58:46.0435151Z 

(Heap dump was uploaded here https://github.com/mumrah/kafka/actions/runs/13639578283/artifacts/2685005476)

This at least shows that the OOM arguments and heap dump archiving are working.

mumrah avatar Mar 03 '25 23:03 mumrah

@chia7712 Even with the recent JUnit heap increase, we are seeing OOM on trunk https://github.com/apache/kafka/actions/runs/13807500150/job/38622104924#step:6:616

Can you take a look at this one?

mumrah avatar Mar 13 '25 13:03 mumrah

Can you take a look at this one?

roger that

chia7712 avatar Mar 13 '25 15:03 chia7712

UserQuotaTest consumes over 1GB of memory on my local machine. I have opened https://issues.apache.org/jira/browse/KAFKA-18989 and https://issues.apache.org/jira/browse/KAFKA-18990 for memory usage optimization. Based on my local testing, merging the changes from those tickets reduces memory consumption from 1GB to ~500MB

However, the memory offered by Github action is 16GB. Our CI executes four workers, each running four tests in parallel. Consequently, if a PR happens to encounter a set of 'expensive' tests simultaneously, it may exceed the memory limit.

chia7712 avatar Mar 14 '25 20:03 chia7712

@chia7712 thanks for investigating. The metric creation inside a loop sounds like the likely culprit.

mumrah avatar Mar 17 '25 20:03 mumrah

WDYT about merging this PR @chia7712? I think it would help us diagnose OOM much faster if we had a heap dump (which of course includes the thread dump in it)

mumrah avatar Mar 18 '25 14:03 mumrah

This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please leave a comment asking for a review. If the PR has merge conflicts, update it with the latest from the base branch.

If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact).

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

github-actions[bot] avatar Jun 18 '25 03:06 github-actions[bot]

This PR has been closed since it has not had any activity in 120 days. If you feel like this was a mistake, or you would like to continue working on it, please feel free to re-open the PR and ask for a review.

github-actions[bot] avatar Jul 18 '25 03:07 github-actions[bot]