SOLR-18024 Flaky test CloudHttp2SolrClientTest.testHttpCspPerf
Based on Develocity data from last 28 days:
- 85 flaky test occurrences (7% flaky rate)
- 1,060 passed occurrences (93%)
- Mean execution time: 32 seconds
https://issues.apache.org/jira/browse/SOLR-18024
I first marked this test @AwaitsFix but then tasked Claude Code with tracking down the flakiness, and he claims to have fixed it in the last commit. Description from AI analysis:
Analysis Summary
Root Cause
The test was flaky due to a race condition involving multiple BaseHttpClusterStateProvider instances:
-
The Problem Chain:
- @BeforeClass setupCluster() creates a MiniSolrCloudCluster with a shared CloudSolrClient
- That shared client's BaseHttpClusterStateProvider is initialized with the default cacheTimeout=5 seconds
- This provider schedules a background job to refresh live nodes every 2.5 seconds (half the cache timeout)
-
The Test's Intent:
- Line 304: Sets SYS_PROP_CACHE_TIMEOUT_SECONDS to Integer.MAX_VALUE
- Intent: Prevent background jobs from polluting the admin call counts
-
What Actually Happened:
- Line 308: Uses cluster.getSolrClient() (the SHARED client with default timeout) to create collection
- Line 312: Creates a NEW CloudSolrClient that respects Integer.MAX_VALUE
- Both clients are now active!
- The shared client's background job fires ~2.5 seconds later, adding unexpected CLUSTERSTATUS calls
- LogListener counts these extra calls, causing assertion failures
-
Evidence from Failure Log: 53625ms: CLUSTERSTATUS (expected - from test client) 53627ms: CLUSTERSTATUS (expected - from test client) 53653ms: CLUSTERSTATUS (UNEXPECTED - from shared cluster client's background job!)
The Fix
Changed: Create collection using the test's own client instead of the shared cluster client
Key improvements:
- Create test's CloudSolrClient BEFORE creating the collection (line 311)
- Use that client to create the collection (line 313-314) instead of cluster.getSolrClient()
- Move LogListener creation to AFTER collection creation (line 318) to avoid counting collection setup calls
This ensures only ONE CloudSolrClient exists during the test with the correct Integer.MAX_VALUE timeout, eliminating spurious background refresh calls.
I am a bit ambivalent... I believe Claude, however I really dislike We avoid using cluster.getSolrClient() because it may have a background refresh job because how would any human being understand that? There isn't anything to help protect the poor human being writing a test from using the cluster.getSolrClient() method... I really wish there were some strong patterns in our tests (not fifteen different weak patterns!) that helped us write tests that would not be flaky. Without the level of deep deep Solr knowledge that some folks have.
Or, do we need a pattern that says "When writing tests, make sure to run Claude to review it because you can't possibly keep all the things in your head and there is no pattern".
I am a bit ambivalent... I believe Claude, however I really dislike
We avoid using cluster.getSolrClient() because it may have a background refresh jobbecause how would any human being understand that? There isn't anything to help protect the poor human being writing a test from using thecluster.getSolrClient()method... I really wish there were some strong patterns in our tests (not fifteen different weak patterns!) that helped us write tests that would not be flaky. Without the level of deep deep Solr knowledge that some folks have.Or, do we need a pattern that says "When writing tests, make sure to run Claude to review it because you can't possibly keep all the things in your head and there is no pattern".
Also, let me just say that if this fixes the bug, great and LGTM.
There isn't anything to help protect the poor human being writing a test from using the
cluster.getSolrClient()method...
I think there is nothing wrong in general in using the cluster client. But the flaky pattern being used in this particular test is imo the reliance of LogListener, i.e. counting number of admin commands in the logs, see comment on top of original test method:
// This ensures CH2SC is caching cluster status by counting the number of logged calls to the
// admin endpoint. Too many calls to CLUSTERSTATUS might mean insufficient caching and
// performance regressions!
// BaseHttpClusterStateProvider has a background job that pre-fetches data from CLUSTERSTATUS
// on timed intervals. This can pollute this test, so we set the interval very high to
// prevent it from running.
The comment says explicitly that it is brittle. I don't know for sure whether Claude made it less brittle, but it sounds plausible that the SYS_PROP_CACHE_TIMEOUT_SECONDS property must be set to Integer.MAX_VALUE before we instantiate the client. But if the cluster level client is already instantiated before the test method starts, it won't work and you get more pollution, and then the exact check assertEquals(2, adminLogs.getCount()); cannot be trusted, and you get flakiness.
I'll let Copilot judge Claude :)
Thanks for the explanation!
If the root cause is what it says it is (seems very plausible to me!), then why isn't this PR nothing more than simply moving this line with it's preceding comments:
System.setProperty(SYS_PROP_CACHE_TIMEOUT_SECONDS, "" + Integer.MAX_VALUE);
To the beginning of setupCluster? I would approve that, but the change here seems needlessly more involved and I'm unsure if it's actually as comprehensive as what I suggest. What I suggest covers all clients, including the several ones created at the start.
If the root cause is what it says it is (seems very plausible to me!), then why isn't this PR nothing more than simply moving this line with it's preceding comments:
System.setProperty(SYS_PROP_CACHE_TIMEOUT_SECONDS, "" + Integer.MAX_VALUE);To the beginning of setupCluster? I would approve that, but the change here seems needlessly more involved and I'm unsure if it's actually as comprehensive as what I suggest. What I suggest covers all clients, including the several ones created at the start.
Well spotted and for zooming out from the context of the failing test. This solution is more elegant. I added collection, and ran tests.iters many times, as well as the whole suite with tests.dups (some tests don't support tests.iters), all looks goodl