kafka icon indicating copy to clipboard operation
kafka copied to clipboard

KAFKA-18981: Fix flaky test QuorumControllerTest.testMinIsrUpdateWithElr

Open Anshul-creator opened this issue 2 months ago • 1 comments

[KAFKA-18981] Deflake testMinIsrUpdateWithElr by maintaining broker heartbeats during waits

What

This PR removes nondeterminism in testMinIsrUpdateWithElr by ensuring broker sessions do not expire while the test is waiting/sleeping, through a lightweight background heartbeat pumper.

JIRA

KAFKA-18981

Reproduction of flakiness

The test is stable until you insert a sleep equal to the controller session timeout immediately after topic creation. With the original code:

  1. In testMinIsrUpdateWithElr, add the line right after:

    CreateTopicsResponseData createTopicsResponseData = active.createTopics(
        ANONYMOUS_CONTEXT, createTopicsRequestData, Set.of("foo", "bar")).get();
    

    Insert:

    Thread.sleep(sessionTimeoutMillis);
    
  2. Run the test loop a few times:

    pass=0; fail=0; for i in {1..5}; do \
     echo "=== Run $i ==="; \
     ./gradlew :metadata:test \
     --tests 'org.apache.kafka.controller.QuorumControllerTest' \
     -Pkafka.test.run.flaky=true \
     -PmaxParallelForks=1 -Dorg.gradle.workers.max=1 -Dtest.forkEvery=1 \
     --rerun-tasks --no-daemon --no-build-cache && pass=$((pass+1)) || fail=$((fail+1)); \
    done; echo "Summary: $pass pass / $fail fail"
    

Why this flakes

  • The controller fences brokers that miss heartbeats for sessionTimeoutMillis.
  • After initially unfencing all brokers, the test performs a sleep/wait before beginning the fencing phase.
  • That pause allows the survivor broker (id 1) to lapse its session and get fenced transiently.
  • Subsequent ISR/ELR assertions sometimes observe this unintended state, causing nondeterministic failures.

How (the fix)

Add a minimal background heartbeat pumper inside the test:

  • Schedule rate: periodMs = max(50ms, sessionTimeoutMillis / 3) to guarantee multiple heartbeats per timeout window.
  • Phase 1 (stabilize): heartbeat all brokers while creating topics and applying the initial broker-level config.
  • Phase 2 (fencing): set a flag to heartbeat only brokersToKeepUnfenced right before waitForCondition that fences brokers 2 and 3.

This preserves the test’s intent (only broker 1 remains unfenced; brokers 2,3 fence) while eliminating timing races introduced by sleeps/awaits.

Verification of stability

With the heartbeat pumper in place and the flakiness sleep enabled, the loop below completed without failures:

pass=0; fail=0; for i in {1..50}; do \
 echo "=== Run $i ==="; \
 ./gradlew :metadata:test \
 --tests 'org.apache.kafka.controller.QuorumControllerTest' \
 -Pkafka.test.run.flaky=true \
 -PmaxParallelForks=1 -Dorg.gradle.workers.max=1 -Dtest.forkEvery=1 \
 --rerun-tasks --no-daemon --no-build-cache && pass=$((pass+1)) || fail=$((fail+1)); \
done; echo "Summary: $pass pass / $fail fail"

Note: I ran the loop 50 times (compared to the 5 times in reproducing the flakiness case) to increase trust in the fix

Scope

  • Test-only change.
  • No production behavior/APIs altered.
  • Keeps existing assertions and ISR/ELR expectations intact.

Anshul-creator avatar Oct 06 '25 21:10 Anshul-creator

A label of 'needs-attention' was automatically added to this PR in order to raise the attention of the committers. Once this issue has been triaged, the triage label should be removed to prevent this automation from happening again.

github-actions[bot] avatar Oct 14 '25 03:10 github-actions[bot]