Flakiness in LB2SolrClientTest.testTwoServers
On Crave runs, about one failure per week.
> Task :solr:core:wipeTaskTemp
ERROR: The following test(s) have failed:
- org.apache.solr.client.solrj.impl.LB2SolrClientTest.testTwoServers (:solr:solrj)
Test history:
https://develocity.apache.org/scans/tests?search.rootProjectNames=solr-root&tests.container=org.apache.solr.client.solrj.impl.LB2SolrClientTest&tests.test=testTwoServers
http://fucit.org/solr-jenkins-reports/history-trend-of-recent-failures.html#series/org.apache.solr.client.solrj.impl.LB2SolrClientTest.testTwoServers
Test output: /tmp/src/solr/solr/solrj/build/test-results/test/outputs/OUTPUT-org.apache.solr.client.solrj.impl.LB2SolrClientTest.txt
Reproduce with: ./gradlew :solr:solrj:test --tests "org.apache.solr.client.solrj.impl.LB2SolrClientTest.testTwoServers" "-Ptests.jvmargs=-XX:TieredStopAtLevel=1
-XX:+UseParallelGC -XX:ActiveProcessorCount=1 -XX:ReservedCodeCacheSize=120m" -Ptests.seed=BFC8AC5F327E55CE -Ptests.timeoutSuite=600000! -Ptests.useSecurityManager=true
-Ptests.file.encoding=US-ASCII
The failure did not reproduce locally even after beasting, which makes it likely to be a timing issue which may be more visible on the ultra-fast Crave hardware. The failure was analyzed by an AI and a possible fix (though not proven) is to add a small sleep between observing the new Jetty starting up (being added to liveNodes) and actually sending requests to it.
Adding you @dsmiley as you touched SolrJ a lot lately, but I don't think any of the recent work causes these test fails.
Due to recent renames, the develocity history for this test is split across two class names. Older history can be seen as LBHttp2SolrClientIntegrationTest here https://develocity.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=solr-root&search.timeZoneId=Europe%2FOslo&tests.container=org.apache.solr.client.solrj.impl.LBHttp2SolrClientIntegrationTest which confirms that this test has been flaky for a long time.
More of a general question... How can we make our test platform not require super "magic" changes like this on a per class basis and solve it more globally? Is there a way that this could have been solved in our core code? Is it possible that in real life someone would hit the same issue of spinning up a Jetty and then sending requests? Could we have a retry or a probe for liveness instead approach that is built in?
And the failing precommit due to Antora / node is also quite annoying, feel I'm seeing it all the time?
* What went wrong:
Execution failed for task ':solr:solr-ref-guide:buildLocalAntoraSite'.
> Process 'command '/home/runner/work/solr/solr/solr/solr-ref-guide/.gradle/node/nodejs/node-v22.18.0-linux-x64/bin/npx'' finished with non-zero exit value 1
<BEGIN RANT> Flaky test is a productivity drain on the whole community. Seeing a bunch of solrbot PRs red due to test flakiness delays dependency upgrades and reduces velocity.
Also, could we somehow split the test suite in two tiers, where a "core" tier is what is run on every PR and with normal gradle test. Then the next tier contains long-running and flaky tests. Then we already have nightly tier, perhaps we can just move more tests to nightly, dunno.
But there is an elephant in this room, that we have been lazy, that almost all our tests are integration tests, where perhaps half of them could have been done as a unit test with mocks etc. Our test suite should be possible to run by a normal drive-by contributor in 5-10 mins. Not as today 1.5 hours and failing half of the runs. </END RANT>
I looked at this for 15min just now. This may help a little but I'm not optimistic; obviously is just a bandaid and test anti-pattern -- what I tell new engineers not to do. I wish we had better means of flaky reproduction other than beasting. In particular, I wonder if there's a technique and/or tool that can simulate the JVM slowing down a lot without actually burdening one's machine.
I suspect the average test on Crave takes longer, but it's massively parallelized to run a crazy number of tests in parallel.
A better compromise without more sleeps on the happy path would be to execute the test that follows this call with org.apache.solr.common.util.RetryUtil. That, I'd get behind.
Thanks for bringing attention to RetryUtil, haven't seen it before. I switched to that strategy and improved RetryUtil with docs at the same time.