kafka icon indicating copy to clipboard operation
kafka copied to clipboard

KAFKA-14242: do not init managers twice to avoid resource leak

Open showuon opened this issue 3 years ago • 9 comments

Recently, we got a lot of build failed (and terminated) with core:unitTest failure. The failed messages look like this:

FAILURE: Build failed with an exception.
[2022-09-14T09:51:52.190Z] 
[2022-09-14T09:51:52.190Z] * What went wrong:
[2022-09-14T09:51:52.190Z] Execution failed for task ':core:unitTest'.
[2022-09-14T09:51:52.190Z] > Process 'Gradle Test Executor 128' finished with non-zero exit value 1

After investigation, I found one reason of it (maybe there are other reasons). In BrokerMetadataPublisherTest#testReloadUpdatedFilesWithoutConfigChange test, we created logManager twice, but when cleanup, we only close one of them. So, there will be a log cleaner keeping running. But during this time, the temp log dirs are deleted, so it will Exit.halt(1), and got the error we saw in gradle, like this code did when we encounter IOException in all our log dirs:

fatal(s"Shutdown broker because all log dirs in ${logDirs.mkString(", ")} have failed")
Exit.halt(1)

And, why does it sometimes pass, sometimes failed? Because during test cluster close, we shutdown broker first, and then other components. And the log cleaner is triggered in an interval. So, if the cluster can close fast enough, and finish this test, it'll be passed. Otherwise, it'll exit with 1.

Fixed it by disable _firstPublish flag for mock publisher, to avoid resource leak.

Committer Checklist (excluded from commit message)

  • [ ] Verify design and implementation
  • [ ] Verify test coverage and CI build status
  • [ ] Verify documentation (including upgrade notes)

showuon avatar Sep 14 '22 13:09 showuon

@hachikuji @jsancio , please take a look. Thanks.

showuon avatar Sep 14 '22 13:09 showuon

Wow! Great deep dive to find the root cause here @showuon 👏

I am curious, how did you narrow down that Gradle Test Executor 128 is related to testReloadUpdatedFilesWithoutConfigChange. Did you check the the logs and search for exception stack traces and managed to get the IOException?

divijvaidya avatar Sep 14 '22 16:09 divijvaidya

@divijvaidya , thanks.

I am curious, how did you narrow down that Gradle Test Executor 128 is related to testReloadUpdatedFilesWithoutConfigChange. Did you check the the logs and search for exception stack traces and managed to get the IOException?

Since the test got terminated, not closed gracefully, so there's no exception stack traces unfortunately, which is why it is not easy to identify the problem. For me, I don't have any trick here. I analyze some of the build logs like here, and found the "core:unitTest" mostly got terminated at ClusterTestExtensionsTest suite. Starting from it, I check the test code, found nothing special, then, find the tests earlier, since I believe there could be some earlier test affect it or somewhat. So, in the end, I found the issue in this PR in the BrokerMetadataPublisherTest test suite. Hope this is the only issue we need to fix!

2022-09-15T01:16:27.075Z] BrokerMetadataPublisherTest > testReloadUpdatedFilesWithoutConfigChange() STARTED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataPublisherTest > testReloadUpdatedFilesWithoutConfigChange() PASSED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testCreateAndClose() STARTED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testCreateAndClose() PASSED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testCreateSnapshot() STARTED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testCreateSnapshot() PASSED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testCreateSnapshotMultipleReasons() STARTED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testCreateSnapshotMultipleReasons() PASSED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testRecordListConsumer() STARTED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testRecordListConsumer() PASSED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] MockConfigRepositoryTest > testEmptyRepository() STARTED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] MockConfigRepositoryTest > testEmptyRepository() PASSED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] MockConfigRepositoryTest > testSetTopicConfig() STARTED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] MockConfigRepositoryTest > testSetTopicConfig() PASSED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] ClusterTestExtensionsTest > testClusterTest(ClusterConfig, ClusterInstance) > kafka.test.ClusterTestExtensionsTest.testClusterTest(ClusterConfig, ClusterInstance)[1] STARTED
[2022-09-15T01:16:28.050Z] 
[2022-09-15T01:16:28.050Z] ClusterTestExtensionsTest > testClusterTest(ClusterConfig, ClusterInstance) > kafka.test.ClusterTestExtensionsTest.testClusterTest(ClusterConfig, ClusterInstance)[1] PASSED
[2022-09-15T01:16:28.050Z] 
[2022-09-15T01:16:28.050Z] ClusterTestExtensionsTest > testNoAutoStart() > kafka.test.ClusterTestExtensionsTest.testNoAutoStart()[1] STARTED
[2022-09-15T01:16:32.108Z] 
[2022-09-15T01:16:32.108Z] ClusterTestExtensionsTest > testNoAutoStart() > kafka.test.ClusterTestExtensionsTest.testNoAutoStart()[1] PASSED
[2022-09-15T01:16:32.108Z] 
[2022-09-15T01:16:32.108Z] ClusterTestExtensionsTest > testClusterTests() > kafka.test.ClusterTestExtensionsTest.testClusterTests()[1] STARTED
[2022-09-15T01:16:34.943Z] 
[2022-09-15T01:16:34.943Z] ClusterTestExtensionsTest > testClusterTests() > kafka.test.ClusterTestExtensionsTest.testClusterTests()[1] PASSED
[2022-09-15T01:16:34.943Z] 
[2022-09-15T01:16:34.943Z] ClusterTestExtensionsTest > testClusterTests() > kafka.test.ClusterTestExtensionsTest.testClusterTests()[2] STARTED
[2022-09-15T01:16:38.147Z] 
[2022-09-15T01:16:38.147Z] ClusterTestExtensionsTest > testClusterTests() > kafka.test.ClusterTestExtensionsTest.testClusterTests()[2] PASSED
[2022-09-15T01:16:38.147Z] 
[2022-09-15T01:16:38.147Z] ClusterTestExtensionsTest > testClusterTests() > kafka.test.ClusterTestExtensionsTest.testClusterTests()[3] STARTED
[2022-09-15T01:16:40.319Z] 
[2022-09-15T01:16:40.319Z] ClusterTestExtensionsTest > testClusterTests() > kafka.test.ClusterTestExtensionsTest.testClusterTests()[3] SKIPPED
[2022-09-15T01:16:41.474Z] 
[2022-09-15T01:16:41.474Z] > Task :core:unitTest FAILED

showuon avatar Sep 15 '22 07:09 showuon

There's been 3 straight builds not get terminated in the :core:unitTest.

image

showuon avatar Sep 15 '22 10:09 showuon

cc @ijuma , I think we should merge this fix soon to help our build turn red back to yellow/green light.

image

showuon avatar Sep 15 '22 12:09 showuon

@dajac @hachikuji , FYI

showuon avatar Sep 18 '22 23:09 showuon

@showuon Great find. I am not sure about the fix though, it seems a bit odd to update the mock in this way. To get things back into a stable state, we should perhaps disable this test while we figure out the right way to fix it.

ijuma avatar Sep 19 '22 00:09 ijuma

@showuon Great find. I am not sure about the fix though, it seems a bit odd to update the mock in this way. To get things back into a stable state, we should perhaps disable this test while we figure out the right way to fix it.

Sure, I've opened https://github.com/apache/kafka/pull/12658 to disable the test first. Thanks.

showuon avatar Sep 19 '22 01:09 showuon

@showuon @ijuma I've taken a stab at a safeguard to prevent tests from accidentally terminating the JVM, and to accurately report where those calls for termination come from. If you're interested, the PR can be found here.

C0urante avatar Sep 20 '22 15:09 C0urante

@hachikuji , please take a look. Thanks.

showuon avatar Sep 29 '22 12:09 showuon

@hachikuji @mumrah , please take a look. Thanks.

showuon avatar Oct 31 '22 07:10 showuon