SOLR-18025 Test LeaderTragicEventTest flaky

Open janhoy opened this issue 1 month ago • 1 comments

Based on Develocity data from last 28 days:

42 flaky test occurrences (4% flaky rate)
1,110 passed occurrences (96%)
Mean execution time: 8.0 seconds

https://issues.apache.org/jira/browse/SOLR-18025

I first tagged this as @AwaitsFix but then tasked Claude Code with hunting down the root cause. He claims to have found something, although I don't understand the tragic-stuff here at all. Last commit contains the fix, here is the analysis:

Root Cause

The test is flaky due to LUCENE-8692: IndexWriter.getTragicException() may not reliably reflect all corrupting exceptions, particularly NoSuchFileException.

Test sets TestInjection.leaderTragedy = "true:100" and sends an update
TestInjection.injectLeaderTragedy() calls writer.get().onTragicEvent() to inject a tragic event
The update throws a SolrException (expected behavior)
RequestHandlerBase catches this and calls coreContainer.checkTragicException(core)
Problem: getTragicException() sometimes returns null due to the Lucene bug
If null, the leader doesn't call giveupLeadership()
Test waits indefinitely for a new leader that never gets elected
Test fails with timeout

The Fix

Strategy: Add retry logic in corruptLeader() to ensure the tragic event is reliably detected and leadership is given up.

Changes Made:

Retry mechanism (up to 3 attempts with 100ms delays):
- If getTragicException() fails to detect the tragedy on first attempt, subsequent update requests trigger additional checks
- Each attempt uses a unique document ID to avoid conflicts
- Stops immediately when tragedy is successfully triggered
Better exception handling:
- Catches both RemoteSolrException (500/404) and AlreadyClosedException
- AlreadyClosedException indicates the core is already closing due to leadership handoff
Enhanced logging:
- Logs each attempt and outcome
- Helps diagnose failures in CI
Changed annotation:
- Changed from @AwaitsFix to @BadApple since we're attempting to fix it

Why This Fix Works

Even if getTragicException() intermittently returns null, retrying gives multiple chances for it to be properly detected
The 100ms delay between attempts allows the system to stabilize
Three attempts provide a good balance between reliability and test execution time
The unique document IDs per attempt prevent conflicts and don't affect the final document count (these docs fail before being committed)

Files Modified

solr/core/src/test/org/apache/solr/cloud/LeaderTragicEventTest.java (+44/-10 lines)

Dec 10 '25 13:12 janhoy

This test has a long history of issues https://issues.apache.org/jira/issues/?jql=summary%20~%20%22LeaderTragicEventTest%22

Dec 10 '25 14:12 janhoy