solr icon indicating copy to clipboard operation
solr copied to clipboard

SOLR-17294: The stall detection in the ConcurrentUpdateSolrClients easily detects false positives.

Open markrmiller opened this issue 1 year ago • 2 comments

The current stall detection mechanism in the ConcurrentUpdateSolrClients is prone to generating false positives, especially under load. As pointed out by Jason, the existing approach simply intermittently monitors the queue size over time to detect stalls. However, this method is insufficient because the queue can report being full for extended periods when the system is under load, even if no actual stall has occurred.

markrmiller avatar May 15 '24 01:05 markrmiller

There is no real reason you need to check the size of the queue as well, but I just kept it anyway.

markrmiller avatar May 15 '24 01:05 markrmiller

Leaving Draft state on still on this one, but last commit brings it closer to what I had in mind. Still have to review it and run tests.

markrmiller avatar May 30 '24 16:05 markrmiller

Was just reminded about this recently when I tried to do some bulk indexing. To me, this Band-Aid ended up worse than what it was trying to cover. When it works, it just hides the underlying issue that caused it from developers, making it unlikely to ever be fixed, but it has never really worked as long as you are indexing fast enough to keep the queue full for stall timeout seconds. And at least with http2, when it kicks in, it totally botches the indexing job from what I've seen. Stall exceptions lead to cancelled stream errors which lead to the whole indexing job grinding down until your using all the cpu you had been using when properly indexing but with no progress.

Regardless, I've removed the Draft label and run the tests.

markrmiller avatar Apr 08 '25 16:04 markrmiller

I guess the one gap here is tests - is there any way to refactor the stall-detection logic into its own class in a way that makes it easier to validate when stalls get detected, when the stall-timer gets reset, etc?

gerlowskija avatar Apr 08 '25 17:04 gerlowskija

@markrmiller this is creating some flaky failures. Such as: https://jenkins.thetaphi.de/view/Solr/job/Solr-main-Linux/24834/

HoustonPutman avatar Apr 22 '25 16:04 HoustonPutman

I rewrote that test method. Will put up tomorrow.

markrmiller avatar Apr 23 '25 08:04 markrmiller