SOLR-17294: The stall detection in the ConcurrentUpdateSolrClients easily detects false positives.
The current stall detection mechanism in the ConcurrentUpdateSolrClients is prone to generating false positives, especially under load. As pointed out by Jason, the existing approach simply intermittently monitors the queue size over time to detect stalls. However, this method is insufficient because the queue can report being full for extended periods when the system is under load, even if no actual stall has occurred.
There is no real reason you need to check the size of the queue as well, but I just kept it anyway.
Leaving Draft state on still on this one, but last commit brings it closer to what I had in mind. Still have to review it and run tests.
Was just reminded about this recently when I tried to do some bulk indexing. To me, this Band-Aid ended up worse than what it was trying to cover. When it works, it just hides the underlying issue that caused it from developers, making it unlikely to ever be fixed, but it has never really worked as long as you are indexing fast enough to keep the queue full for stall timeout seconds. And at least with http2, when it kicks in, it totally botches the indexing job from what I've seen. Stall exceptions lead to cancelled stream errors which lead to the whole indexing job grinding down until your using all the cpu you had been using when properly indexing but with no progress.
Regardless, I've removed the Draft label and run the tests.
I guess the one gap here is tests - is there any way to refactor the stall-detection logic into its own class in a way that makes it easier to validate when stalls get detected, when the stall-timer gets reset, etc?
@markrmiller this is creating some flaky failures. Such as: https://jenkins.thetaphi.de/view/Solr/job/Solr-main-Linux/24834/
I rewrote that test method. Will put up tomorrow.