KAFKA-18981: Fix flaky test QuorumControllerTest.testMinIsrUpdateWithElr
[KAFKA-18981] Deflake testMinIsrUpdateWithElr by maintaining broker heartbeats during waits
What
This PR removes nondeterminism in testMinIsrUpdateWithElr by ensuring broker sessions do not expire while the test is waiting/sleeping, through a lightweight background heartbeat pumper.
JIRA
Reproduction of flakiness
The test is stable until you insert a sleep equal to the controller session timeout immediately after topic creation. With the original code:
-
In
testMinIsrUpdateWithElr, add the line right after:CreateTopicsResponseData createTopicsResponseData = active.createTopics( ANONYMOUS_CONTEXT, createTopicsRequestData, Set.of("foo", "bar")).get();Insert:
Thread.sleep(sessionTimeoutMillis); -
Run the test loop a few times:
pass=0; fail=0; for i in {1..5}; do \ echo "=== Run $i ==="; \ ./gradlew :metadata:test \ --tests 'org.apache.kafka.controller.QuorumControllerTest' \ -Pkafka.test.run.flaky=true \ -PmaxParallelForks=1 -Dorg.gradle.workers.max=1 -Dtest.forkEvery=1 \ --rerun-tasks --no-daemon --no-build-cache && pass=$((pass+1)) || fail=$((fail+1)); \ done; echo "Summary: $pass pass / $fail fail"
Why this flakes
- The controller fences brokers that miss heartbeats for
sessionTimeoutMillis. - After initially unfencing all brokers, the test performs a sleep/wait before beginning the fencing phase.
- That pause allows the survivor broker (id
1) to lapse its session and get fenced transiently. - Subsequent ISR/ELR assertions sometimes observe this unintended state, causing nondeterministic failures.
How (the fix)
Add a minimal background heartbeat pumper inside the test:
- Schedule rate:
periodMs = max(50ms, sessionTimeoutMillis / 3)to guarantee multiple heartbeats per timeout window. - Phase 1 (stabilize): heartbeat all brokers while creating topics and applying the initial broker-level config.
- Phase 2 (fencing): set a flag to heartbeat only
brokersToKeepUnfencedright beforewaitForConditionthat fences brokers 2 and 3.
This preserves the test’s intent (only broker 1 remains unfenced; brokers 2,3 fence) while eliminating timing races introduced by sleeps/awaits.
Verification of stability
With the heartbeat pumper in place and the flakiness sleep enabled, the loop below completed without failures:
pass=0; fail=0; for i in {1..50}; do \
echo "=== Run $i ==="; \
./gradlew :metadata:test \
--tests 'org.apache.kafka.controller.QuorumControllerTest' \
-Pkafka.test.run.flaky=true \
-PmaxParallelForks=1 -Dorg.gradle.workers.max=1 -Dtest.forkEvery=1 \
--rerun-tasks --no-daemon --no-build-cache && pass=$((pass+1)) || fail=$((fail+1)); \
done; echo "Summary: $pass pass / $fail fail"
Note: I ran the loop 50 times (compared to the 5 times in reproducing the flakiness case) to increase trust in the fix
Scope
- Test-only change.
- No production behavior/APIs altered.
- Keeps existing assertions and ISR/ELR expectations intact.
A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.