solr icon indicating copy to clipboard operation
solr copied to clipboard

SOLR-18025 Test LeaderTragicEventTest flaky

Open janhoy opened this issue 1 month ago • 1 comments

Based on Develocity data from last 28 days:

  • 42 flaky test occurrences (4% flaky rate)
  • 1,110 passed occurrences (96%)
  • Mean execution time: 8.0 seconds

https://issues.apache.org/jira/browse/SOLR-18025

I first tagged this as @AwaitsFix but then tasked Claude Code with hunting down the root cause. He claims to have found something, although I don't understand the tragic-stuff here at all. Last commit contains the fix, here is the analysis:

Root Cause

The test is flaky due to LUCENE-8692: IndexWriter.getTragicException() may not reliably reflect all corrupting exceptions, particularly NoSuchFileException.

  1. Test sets TestInjection.leaderTragedy = "true:100" and sends an update
  2. TestInjection.injectLeaderTragedy() calls writer.get().onTragicEvent() to inject a tragic event
  3. The update throws a SolrException (expected behavior)
  4. RequestHandlerBase catches this and calls coreContainer.checkTragicException(core)
  5. Problem: getTragicException() sometimes returns null due to the Lucene bug
  6. If null, the leader doesn't call giveupLeadership()
  7. Test waits indefinitely for a new leader that never gets elected
  8. Test fails with timeout

The Fix

Strategy: Add retry logic in corruptLeader() to ensure the tragic event is reliably detected and leadership is given up.

Changes Made:

  1. Retry mechanism (up to 3 attempts with 100ms delays):

    • If getTragicException() fails to detect the tragedy on first attempt, subsequent update requests trigger additional checks
    • Each attempt uses a unique document ID to avoid conflicts
    • Stops immediately when tragedy is successfully triggered
  2. Better exception handling:

    • Catches both RemoteSolrException (500/404) and AlreadyClosedException
    • AlreadyClosedException indicates the core is already closing due to leadership handoff
  3. Enhanced logging:

    • Logs each attempt and outcome
    • Helps diagnose failures in CI
  4. Changed annotation:

    • Changed from @AwaitsFix to @BadApple since we're attempting to fix it

Why This Fix Works

  • Even if getTragicException() intermittently returns null, retrying gives multiple chances for it to be properly detected
  • The 100ms delay between attempts allows the system to stabilize
  • Three attempts provide a good balance between reliability and test execution time
  • The unique document IDs per attempt prevent conflicts and don't affect the final document count (these docs fail before being committed)

Files Modified

  • solr/core/src/test/org/apache/solr/cloud/LeaderTragicEventTest.java (+44/-10 lines)

janhoy avatar Dec 10 '25 13:12 janhoy

This test has a long history of issues https://issues.apache.org/jira/issues/?jql=summary%20~%20%22LeaderTragicEventTest%22

janhoy avatar Dec 10 '25 14:12 janhoy