Flaky-test: PulsarWorkerRebalanceDrainTest.testRebalanceWorkers. Function worker leader changes unpredictably causing flakiness.

Open ShivrajJ opened this issue 7 months ago • 1 comments

Search before reporting

[x] I searched in the issues and found nothing similar.

Example failure

https://gist.github.com/ShivrajJ/134d23d79a6e122677fd5b300c4de3fa

Exception stacktrace

I added a debug-level log with the topic stats, so line numbers in the stack trace might be offset a little, see https://github.com/cognitree/pulsar/pull/22 for the changes

Expected :true
Actual   :false
<Click to see difference>

java.lang.AssertionError:
	at org.testng.Assert.fail(Assert.java:110)
	at org.testng.Assert.failNotEquals(Assert.java:1577)
	at org.testng.Assert.assertTrue(Assert.java:56)
	at org.testng.Assert.assertTrue(Assert.java:66)
	at org.apache.pulsar.tests.integration.functions.java.PulsarWorkerRebalanceDrainTest.testRebalance(PulsarWorkerRebalanceDrainTest.java:350)
	at org.apache.pulsar.tests.integration.functions.java.PulsarWorkerRebalanceDrainTest.testRebalanceWorkers(PulsarWorkerRebalanceDrainTest.java:70)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:139)
	at org.testng.internal.invokers.TestInvoker.invokeMethod(TestInvoker.java:677)
	at org.testng.internal.invokers.TestInvoker.invokeTestMethod(TestInvoker.java:221)
	at org.testng.internal.invokers.MethodRunner.runInSequence(MethodRunner.java:50)
	at org.testng.internal.invokers.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:969)
	at org.testng.internal.invokers.TestInvoker.invokeTestMethods(TestInvoker.java:194)
	at org.testng.internal.invokers.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:148)
	at org.testng.internal.invokers.TestMethodWorker.run(TestMethodWorker.java:128)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	at org.testng.TestRunner.privateRun(TestRunner.java:829)
	at org.testng.TestRunner.run(TestRunner.java:602)
	at org.testng.SuiteRunner.runTest(SuiteRunner.java:437)
	at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:431)
	at org.testng.SuiteRunner.privateRun(SuiteRunner.java:391)
	at org.testng.SuiteRunner.run(SuiteRunner.java:330)
	at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
	at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:95)
	at org.testng.TestNG.runSuitesSequentially(TestNG.java:1256)
	at org.testng.TestNG.runSuitesLocally(TestNG.java:1176)
	at org.testng.TestNG.runSuites(TestNG.java:1099)
	at org.testng.TestNG.run(TestNG.java:1067)
	at com.intellij.rt.testng.IDEARemoteTestNG.run(IDEARemoteTestNG.java:65)
	at com.intellij.rt.testng.RemoteTestNGStarter.main(RemoteTestNGStarter.java:105)

Are you willing to submit a PR?

[ ] I'm willing to submit a PR!

May 21 '25 13:05 ShivrajJ

Description:

In the tests module, the integration test PulsarWorkerRebalanceDrainTest#testRebalanceWorkers is flaky. Sometimes, after adding more function workers as part of the test, it fails because the function worker leader changes unexpectedly.

Expected Behaviour: The original function worker leader should remain the same after adding more workers.

Actual Behaviour: The leader changes unexpectedly in some cases even though the connectedSince field in the topic stats for the coordination topic doesn't change, and there are no disconnection messages in the logs before the leadership changes.

Steps to Reproduce:

Build the docker images (pulsar, pulsar-all, pulsar-test-latest-version)
Run the PulsarWorkerRebalanceDrainTest#testRebalanceWorkers test (sometimes requires multiple runs.).
Observe the test logs (Cluster leader before..., Cluster leader after...)

Notes:

I added a debug-level log with the topic stats, so line numbers in the stack trace might be offset a little, see https://github.com/cognitree/pulsar/pull/22 for the changes

From my analysis, I've noticed the problem is more frequent in the Thread runtime, but it's pretty inconsistent on both Process and Thread runtimes.

The function workers decide their leader based on a failover subscription to the public/functions/coordinate topic (non-partitioned), so there is no apparent reason for the function worker to change since the 'connectedSince' field for the original leader doesn't change in the topic stats.

I've noticed no disconnection messages in the logs from the original leader until after the leadership changes, either. The original leader's producers on the assignment and metadata topics disconnect after it loses leadership, which seems to be the expected behaviour.

I can add more logs from the containers to the gist if needed

https://gist.github.com/ShivrajJ/134d23d79a6e122677fd5b300c4de3fa

May 21 '25 14:05 ShivrajJ