solr icon indicating copy to clipboard operation
solr copied to clipboard

SOLR-17158 Terminate distributed processing quickly when query limit is reached

Open gus-asf opened this issue 1 year ago • 1 comments

In the case where partial results are not desirable, there is no reason to await any remaining requests, or to return any partial results. This change causes the &partialResults=false query parameter to short circuit processing and omit any results. This condition is indicated with a partialResults="omitted" value in the response (prior behavior was a value of true or absent)

https://issues.apache.org/jira/browse/SOLR-17158

This PR replaces https://github.com/apache/solr/pull/2379 which had become onerously stale.

gus-asf avatar Aug 26 '24 00:08 gus-asf

looking back at https://github.com/apache/solr/pull/2493 I'm exploring if there are ways to also shorten the case where partial results are desired, but possibly due to super short CpuTimeAllowed in some tests, I'm occasionally (but not consistently) hitting cases where response builder hasn't done any work yet so various bits (such as requestIds) are null or otherwise uninitialized when we get to handleRequestBody. Exactly why is unclear, but some bits of responsebuilder maybe should be initialized by default. (i.e requestIds can be set to Collections.emptyMap() initially, since dozens of places don't check it for null)

gus-asf avatar Aug 26 '24 14:08 gus-asf

Also, I've decided to presume that the failure to call transform in the ParallelHttpShardHandler was an oversight.

gus-asf avatar Sep 25 '24 16:09 gus-asf

This is now feeling complete, I'll be looking to commit it tomorrow afternoon or perhaps Friday morning to main if nobody objects or requests more time to review.

I've had to back off any changes that mimic https://github.com/apache/solr/pull/2493

I had code for that, but I got lucky and hit a rare exception on one of my test runs, I found that ~1 in 20 times I was getting a CancellationException, and something less than 1 in 100 times (once it ran 800 iterations before failing) I was seeing a test fail where "omitted" was not properly reported with partialResults=false. I think I solved the CancellationException but ran out of time/energy to chase down whatever scenario was causing the rare test fail. Without that code (as the PR is now) I ran the CPU timeout test > 1500 times with no failures, so I'm confident that the problem is not transmitted in this PR, and additional enhancement can be done later. (I'll probably post a branch off of this branch showing what I had there for future reference).

gus-asf avatar Sep 25 '24 20:09 gus-asf

As a sanity check I ran the jmh benchmarks on main (at point last merged in) and this PR branch. Here's the comparison:

https://docs.google.com/spreadsheets/d/1goRvpW1XEuLMrN_1BkB0OGHFYRDLUqo7E8XTCVvp9sU/edit?usp=sharing

Looks like most changes are noise, and the biggest changes are improvements I'm comfortable with this.

gus-asf avatar Sep 30 '24 19:09 gus-asf

Ok I think this is ready, and I think I've answered the questions/feedback so far, so I'm putting it in main, but if issues arise or there's a strong feeling something needs to be reworked let me know, and we can fix it or revert if necessary. I want to move on to SOLR-17151. If things remain quiet until this ticket on Friday, I'll start looking at pushing a version of it to 9x

gus-asf avatar Oct 02 '24 00:10 gus-asf