KAFKA-14242: do not init managers twice to avoid resource leak
Recently, we got a lot of build failed (and terminated) with core:unitTest failure. The failed messages look like this:
FAILURE: Build failed with an exception.
[2022-09-14T09:51:52.190Z]
[2022-09-14T09:51:52.190Z] * What went wrong:
[2022-09-14T09:51:52.190Z] Execution failed for task ':core:unitTest'.
[2022-09-14T09:51:52.190Z] > Process 'Gradle Test Executor 128' finished with non-zero exit value 1
After investigation, I found one reason of it (maybe there are other reasons). In BrokerMetadataPublisherTest#testReloadUpdatedFilesWithoutConfigChange test, we created logManager twice, but when cleanup, we only close one of them. So, there will be a log cleaner keeping running. But during this time, the temp log dirs are deleted, so it will Exit.halt(1), and got the error we saw in gradle, like this code did when we encounter IOException in all our log dirs:
fatal(s"Shutdown broker because all log dirs in ${logDirs.mkString(", ")} have failed")
Exit.halt(1)
And, why does it sometimes pass, sometimes failed? Because during test cluster close, we shutdown broker first, and then other components. And the log cleaner is triggered in an interval. So, if the cluster can close fast enough, and finish this test, it'll be passed. Otherwise, it'll exit with 1.
Fixed it by disable _firstPublish flag for mock publisher, to avoid resource leak.
Committer Checklist (excluded from commit message)
- [ ] Verify design and implementation
- [ ] Verify test coverage and CI build status
- [ ] Verify documentation (including upgrade notes)
@hachikuji @jsancio , please take a look. Thanks.
Wow! Great deep dive to find the root cause here @showuon 👏
I am curious, how did you narrow down that Gradle Test Executor 128 is related to testReloadUpdatedFilesWithoutConfigChange. Did you check the the logs and search for exception stack traces and managed to get the IOException?
@divijvaidya , thanks.
I am curious, how did you narrow down that Gradle Test Executor 128 is related to testReloadUpdatedFilesWithoutConfigChange. Did you check the the logs and search for exception stack traces and managed to get the IOException?
Since the test got terminated, not closed gracefully, so there's no exception stack traces unfortunately, which is why it is not easy to identify the problem. For me, I don't have any trick here. I analyze some of the build logs like here, and found the "core:unitTest" mostly got terminated at ClusterTestExtensionsTest suite. Starting from it, I check the test code, found nothing special, then, find the tests earlier, since I believe there could be some earlier test affect it or somewhat. So, in the end, I found the issue in this PR in the BrokerMetadataPublisherTest test suite. Hope this is the only issue we need to fix!
2022-09-15T01:16:27.075Z] BrokerMetadataPublisherTest > testReloadUpdatedFilesWithoutConfigChange() STARTED
[2022-09-15T01:16:27.075Z]
[2022-09-15T01:16:27.075Z] BrokerMetadataPublisherTest > testReloadUpdatedFilesWithoutConfigChange() PASSED
[2022-09-15T01:16:27.075Z]
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testCreateAndClose() STARTED
[2022-09-15T01:16:27.075Z]
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testCreateAndClose() PASSED
[2022-09-15T01:16:27.075Z]
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testCreateSnapshot() STARTED
[2022-09-15T01:16:27.075Z]
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testCreateSnapshot() PASSED
[2022-09-15T01:16:27.075Z]
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testCreateSnapshotMultipleReasons() STARTED
[2022-09-15T01:16:27.075Z]
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testCreateSnapshotMultipleReasons() PASSED
[2022-09-15T01:16:27.075Z]
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testRecordListConsumer() STARTED
[2022-09-15T01:16:27.075Z]
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testRecordListConsumer() PASSED
[2022-09-15T01:16:27.075Z]
[2022-09-15T01:16:27.075Z] MockConfigRepositoryTest > testEmptyRepository() STARTED
[2022-09-15T01:16:27.075Z]
[2022-09-15T01:16:27.075Z] MockConfigRepositoryTest > testEmptyRepository() PASSED
[2022-09-15T01:16:27.075Z]
[2022-09-15T01:16:27.075Z] MockConfigRepositoryTest > testSetTopicConfig() STARTED
[2022-09-15T01:16:27.075Z]
[2022-09-15T01:16:27.075Z] MockConfigRepositoryTest > testSetTopicConfig() PASSED
[2022-09-15T01:16:27.075Z]
[2022-09-15T01:16:27.075Z] ClusterTestExtensionsTest > testClusterTest(ClusterConfig, ClusterInstance) > kafka.test.ClusterTestExtensionsTest.testClusterTest(ClusterConfig, ClusterInstance)[1] STARTED
[2022-09-15T01:16:28.050Z]
[2022-09-15T01:16:28.050Z] ClusterTestExtensionsTest > testClusterTest(ClusterConfig, ClusterInstance) > kafka.test.ClusterTestExtensionsTest.testClusterTest(ClusterConfig, ClusterInstance)[1] PASSED
[2022-09-15T01:16:28.050Z]
[2022-09-15T01:16:28.050Z] ClusterTestExtensionsTest > testNoAutoStart() > kafka.test.ClusterTestExtensionsTest.testNoAutoStart()[1] STARTED
[2022-09-15T01:16:32.108Z]
[2022-09-15T01:16:32.108Z] ClusterTestExtensionsTest > testNoAutoStart() > kafka.test.ClusterTestExtensionsTest.testNoAutoStart()[1] PASSED
[2022-09-15T01:16:32.108Z]
[2022-09-15T01:16:32.108Z] ClusterTestExtensionsTest > testClusterTests() > kafka.test.ClusterTestExtensionsTest.testClusterTests()[1] STARTED
[2022-09-15T01:16:34.943Z]
[2022-09-15T01:16:34.943Z] ClusterTestExtensionsTest > testClusterTests() > kafka.test.ClusterTestExtensionsTest.testClusterTests()[1] PASSED
[2022-09-15T01:16:34.943Z]
[2022-09-15T01:16:34.943Z] ClusterTestExtensionsTest > testClusterTests() > kafka.test.ClusterTestExtensionsTest.testClusterTests()[2] STARTED
[2022-09-15T01:16:38.147Z]
[2022-09-15T01:16:38.147Z] ClusterTestExtensionsTest > testClusterTests() > kafka.test.ClusterTestExtensionsTest.testClusterTests()[2] PASSED
[2022-09-15T01:16:38.147Z]
[2022-09-15T01:16:38.147Z] ClusterTestExtensionsTest > testClusterTests() > kafka.test.ClusterTestExtensionsTest.testClusterTests()[3] STARTED
[2022-09-15T01:16:40.319Z]
[2022-09-15T01:16:40.319Z] ClusterTestExtensionsTest > testClusterTests() > kafka.test.ClusterTestExtensionsTest.testClusterTests()[3] SKIPPED
[2022-09-15T01:16:41.474Z]
[2022-09-15T01:16:41.474Z] > Task :core:unitTest FAILED
There's been 3 straight builds not get terminated in the :core:unitTest.
cc @ijuma , I think we should merge this fix soon to help our build turn red back to yellow/green light.
@dajac @hachikuji , FYI
@showuon Great find. I am not sure about the fix though, it seems a bit odd to update the mock in this way. To get things back into a stable state, we should perhaps disable this test while we figure out the right way to fix it.
@showuon Great find. I am not sure about the fix though, it seems a bit odd to update the mock in this way. To get things back into a stable state, we should perhaps disable this test while we figure out the right way to fix it.
Sure, I've opened https://github.com/apache/kafka/pull/12658 to disable the test first. Thanks.
@showuon @ijuma I've taken a stab at a safeguard to prevent tests from accidentally terminating the JVM, and to accurately report where those calls for termination come from. If you're interested, the PR can be found here.
@hachikuji , please take a look. Thanks.
@hachikuji @mumrah , please take a look. Thanks.