lucene-solr icon indicating copy to clipboard operation
lucene-solr copied to clipboard

SOLR-14935: Solr can forward request ( remoteQuery ) even if there are local cores present

Open vthacker opened this issue 5 years ago • 1 comments

Description

When querying using SolrJ today this is what happens -

  • CloudSolrClient creates a list of all replicas from the alias ( after resolving it ) and then shuffles the list picking one Solr node to query against
  • When Solr receives the request, it only looks at the first collection from the alias and tries to find a local core
  • Anytime that isn't the case, it makes a remoteQuery , proxying the request to machineX which has a local core of the first collection from the alias.

Solution

The solution involves trying to find a local core by looking at all the collections in the list and not just the first collection. this is a cheap operation and saves the overhead of an extra network hop

Tests

Performance

I was lucky enough to have a cluster which had exact production characteristics but was in dark mode. So we were able to test this change out internally and check out it's effectiveness

After a rolling restart of the cluster we saw the CPU idle percentage increase by roughly 10%

image

This improved the end to end latency as well. The graph purposely doesn't have the Y axis but it's a stack graph that buckets latency ( for example what % queries fall bellow 100ms etc ) . So the base green and yellow lines increasing means we are able to serve queries faster.

image

I don't expect the latency improvements to be this drastic for everyone. In our specific case we had 2 things that really improved with this change

  1. We have ~15 collections in an alias. So there was only a 14/15 times the query would end up being a remote query for searches
  2. These remoteQueries were going through PKI authentication and not BasicAuth ( we use forwardCredentals=true ) . In our flame graphs we had seen this take a significant percentage of wall clock time ( 30% ) .

image

When we made the change to use forwardCredentals=true we saw this reduce to

image

It is when we saw PKIAuth still show up in the flemagraph that we noticed remoteQuery being the reason

After the rollout of this PR this PKI isn't visible in the flamegraph at all 🎉

image

Checklist

Please review the following and check all that apply:

  • [ ] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • [ ] I have created a Jira issue and added the issue ID to my pull request title.
  • [ ] I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • [ ] I have developed this patch against the master branch.
  • [ ] I have run ./gradlew check.
  • [ ] I have added tests for my changes.
  • [ ] I have added documentation for the Ref Guide (for Solr changes only).

vthacker avatar Oct 21 '20 00:10 vthacker

@vthacker any update?

LifeIsStrange avatar Nov 19 '21 00:11 LifeIsStrange