SOLR-18025 Test LeaderTragicEventTest flaky
Based on Develocity data from last 28 days:
- 42 flaky test occurrences (4% flaky rate)
- 1,110 passed occurrences (96%)
- Mean execution time: 8.0 seconds
https://issues.apache.org/jira/browse/SOLR-18025
I first tagged this as @AwaitsFix but then tasked Claude Code with hunting down the root cause. He claims to have found something, although I don't understand the tragic-stuff here at all. Last commit contains the fix, here is the analysis:
Root Cause
The test is flaky due to LUCENE-8692: IndexWriter.getTragicException() may not reliably reflect all corrupting exceptions, particularly NoSuchFileException.
- Test sets TestInjection.leaderTragedy = "true:100" and sends an update
- TestInjection.injectLeaderTragedy() calls writer.get().onTragicEvent() to inject a tragic event
- The update throws a SolrException (expected behavior)
- RequestHandlerBase catches this and calls coreContainer.checkTragicException(core)
- Problem: getTragicException() sometimes returns null due to the Lucene bug
- If null, the leader doesn't call giveupLeadership()
- Test waits indefinitely for a new leader that never gets elected
- Test fails with timeout
The Fix
Strategy: Add retry logic in corruptLeader() to ensure the tragic event is reliably detected and leadership is given up.
Changes Made:
-
Retry mechanism (up to 3 attempts with 100ms delays):
- If getTragicException() fails to detect the tragedy on first attempt, subsequent update requests trigger additional checks
- Each attempt uses a unique document ID to avoid conflicts
- Stops immediately when tragedy is successfully triggered
-
Better exception handling:
- Catches both RemoteSolrException (500/404) and AlreadyClosedException
- AlreadyClosedException indicates the core is already closing due to leadership handoff
-
Enhanced logging:
- Logs each attempt and outcome
- Helps diagnose failures in CI
-
Changed annotation:
- Changed from @AwaitsFix to @BadApple since we're attempting to fix it
Why This Fix Works
- Even if getTragicException() intermittently returns null, retrying gives multiple chances for it to be properly detected
- The 100ms delay between attempts allows the system to stabilize
- Three attempts provide a good balance between reliability and test execution time
- The unique document IDs per attempt prevent conflicts and don't affect the final document count (these docs fail before being committed)
Files Modified
- solr/core/src/test/org/apache/solr/cloud/LeaderTragicEventTest.java (+44/-10 lines)
This test has a long history of issues https://issues.apache.org/jira/issues/?jql=summary%20~%20%22LeaderTragicEventTest%22