SOLR-14935: Solr can forward request ( remoteQuery ) even if there are local cores present
Description
When querying using SolrJ today this is what happens -
- CloudSolrClient creates a list of all replicas from the alias ( after resolving it ) and then shuffles the list picking one Solr node to query against
- When Solr receives the request, it only looks at the first collection from the alias and tries to find a local core
- Anytime that isn't the case, it makes a remoteQuery , proxying the request to machineX which has a local core of the first collection from the alias.
Solution
The solution involves trying to find a local core by looking at all the collections in the list and not just the first collection. this is a cheap operation and saves the overhead of an extra network hop
Tests
Performance
I was lucky enough to have a cluster which had exact production characteristics but was in dark mode. So we were able to test this change out internally and check out it's effectiveness
After a rolling restart of the cluster we saw the CPU idle percentage increase by roughly 10%

This improved the end to end latency as well. The graph purposely doesn't have the Y axis but it's a stack graph that buckets latency ( for example what % queries fall bellow 100ms etc ) . So the base green and yellow lines increasing means we are able to serve queries faster.

I don't expect the latency improvements to be this drastic for everyone. In our specific case we had 2 things that really improved with this change
- We have ~15 collections in an alias. So there was only a 14/15 times the query would end up being a remote query for searches
- These remoteQueries were going through PKI authentication and not BasicAuth ( we use
forwardCredentals=true) . In our flame graphs we had seen this take a significant percentage of wall clock time ( 30% ) .

When we made the change to use forwardCredentals=true we saw this reduce to

It is when we saw PKIAuth still show up in the flemagraph that we noticed remoteQuery being the reason
After the rollout of this PR this PKI isn't visible in the flamegraph at all 🎉

Checklist
Please review the following and check all that apply:
- [ ] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
- [ ] I have created a Jira issue and added the issue ID to my pull request title.
- [ ] I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
- [ ] I have developed this patch against the
masterbranch. - [ ] I have run
./gradlew check. - [ ] I have added tests for my changes.
- [ ] I have added documentation for the Ref Guide (for Solr changes only).
@vthacker any update?