neural-search icon indicating copy to clipboard operation
neural-search copied to clipboard

[BUG] Total hits count mismatch in Hybrid Query

Open vibrantvarun opened this issue 9 months ago • 1 comments

What is the bug?

The total count is wrong when size is given in the search request. For example When I hit the search request with Hybrid query with url below

http://localhost:9200/my-nlp-index/_search?search_pipeline=nlp-search-pipeline

I get

{
    "took": 16,
    "timed_out": false,
    "_shards": {
        "total": 3,
        "successful": 3,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 41,
            "relation": "eq"
        },
        "max_score": 0.6666667,
        "hits": [
            {
                "_index": "my-nlp-index",
                "_id": "fgtUoo8BrrBUnpkM6gYl",
                "_score": 0.6666667,
                "_source": {
                    "id": "s55",
                    "stock": 55,

But When I hit the search request by adding size =1 I get different hit count

 http://localhost:9200/my-nlp-index/_search?size=1&search_pipeline=nlp-search-pipeline

the response is

{
    "took": 21,
    "timed_out": false,
    "_shards": {
        "total": 3,
        "successful": 3,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 6,
            "relation": "eq"
        },
        "max_score": 0.33333334,
        "hits": [
            {
                                

How can one reproduce the bug?

  1. Add more than 50 documents in the index which has multiple shards
  2. And try search by give size 1 to 5. The same error will be generated.

What is the expected behavior?

Number of total hits should be consistent.

Do you have any additional context?

The reason why this bug is coming because we calculate totalhits per shard by identifying the number of unique docId's in the result here. The query result per shard is based on numHits here. The numHits is nothing but the size sent in the search request. So for every shard it will get only 1 doc in the query result. But, collector will get the complete result irrespective of what size is given in the search request. size parameter aka numhits is always used to cut the result into size what user wants.

Let's try understand with an example

I have created my-nlp-index which has 3 shards. I have given 2 subqueries and size=1 in the search request.

Shard 1 Consider, the collector will collect the matching results(The count of the results is 40) but the when topDocs will be determined from the matching results it size would be 1 per subquery. Remaining 39 documents won't be added in the topDocs. Later, when we calculate total hits we will just calculate the unique docId's on the shard from both the subqueries. Consider that unique docId count to be 2.

Shard 2 If the same count(40) of results we got on shard 2 as well and the unique docId count is 2.

Therefore, with the current workflow the search response would have 4 as total hits. But, the actual total hit is 80.

vibrantvarun avatar May 23 '24 05:05 vibrantvarun