neural-search icon indicating copy to clipboard operation
neural-search copied to clipboard

[FEATURE] Hybrid request does not return inner_hits for nested objects.

Open Kovsonq opened this issue 1 year ago • 17 comments

Is your feature request related to a problem?

Yes, I'm experiencing a problem when I use the hybrid search plugin in OpenSearch v2.11.0. Specifically, when I include the "inner_hits" parameter in my query for nested objects, I do not receive any inner hits in the response. This is causing frustration as my system requires this level of detail for optimal operation.

What solution would you like?

I would like the hybrid search plugin to be updated to include the functionality to correctly return inner hits from nested queries. Ideally, this would function seamlessly as it does in standard OpenSearch queries. This improvement would allow me and other users to fully utilize the power of the hybrid search plugin.

Kovsonq avatar Apr 30 '24 10:04 Kovsonq

Can you please share more details for us to understand your request better: index mapping, query example, expected response?

martin-gaievski avatar May 01 '24 00:05 martin-gaievski

I removed vectors values, do you need them also?

Index mapping :

{
  "mappings": {
    "properties": {
      "chunks": {
        "type": "nested",
        "properties": {
          "embedding": {
            "type": "knn_vector",
            "dimension": 1536,
            "method": {
              "name": "hnsw",
              "space_type": "cosinesimil",
              "engine": "nmslib",
              "parameters": {
                "ef_construction": 128,
                "m": 24
              }
            }
          },
          "payload": {
            "index": "true",
            "norms": "false",
            "store": "true",
            "type": "text"
          },
          "length": {
            "type": "integer"
          },
          "id": {
            "type": "text"
          }
        }
      }
    }
  },
  "settings": {
    "index": {
      "knn": true,
      "number_of_shards": 5,
      "number_of_replicas": 1
    }
  }
}

Document example:

{
    "chunks": [
        {
            "id": 1,
            "length": 173,
            "payload": "Text 1 example",
            "tokens": 256,
            "embedding": [...]
        },
        {
            "id": 2,
            "length": 173,
            "payload": "Text 2 example",
            "tokens": 256,
            "embedding": [...]
        },
        {
            "id": 3,
            "length": 173,
            "payload": "Text 3 example",
            "tokens": 256,
            "embedding": [...]
        }
    ]
}

request:

{
    "_source": false,
    "query": {
        "hybrid": {
            "queries": [
                {
                    "nested": {
                        "path": "chunks",
                        "query": {
                            "knn": {
                                "chunks.embedding": {
                                    "vector": [...],
                                    "k": 10
                                }
                            }
                        },
                        "inner_hits": {
                            "size": 10,
                            "_source": {
                                "includes": [
                                    "chunks.payload",
                                    "chunks.id"
                                ]
                            }
                        }
                    }
                },
                {
                    "bool": {
                        "must": [
                            {
                                "nested": {
                                    "path": "chunks",
                                    "query": {
                                        "simple_query_string": {
                                            "query": "*",
                                            "fields": [
                                                "chunks.payload"
                                            ],
                                            "default_operator": "and"
                                        }
                                    }
                                }
                            }
                        ]
                    }
                }
            ]
        }
    }
}

response:

{
    "took": 18,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "index_name",
                "_id": "doc_id_1",
                "_score": 1.0,
            }
        ]
    }
}

expected response:

{
    "took": 17,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1,
        "hits": [
            {
                "_index": "index_name",
                "_id": "doc_id_1",
                "_score": 1,
                "inner_hits": {
                    "hsr_chunks": {
                        "hits": {
                            "total": {
                                "value": 3,
                                "relation": "eq"
                            },
                            "max_score": 0.7954481,
                            "hits": [
                                {
                                    "_index": "index_name",
                                    "_id": "doc_id_1",
                                    "_nested": {
                                        "field": "hsr_chunks",
                                        "offset": 0
                                    },
                                    "_score": 0.7954481,
                                    "_source": {
                                        "payload": "Text 1 example",
                                        "id": 1
                                    }
                                },
                                {
                                    "_index": "index_name",
                                    "_id": "doc_id_1",
                                    "_nested": {
                                        "field": "hsr_chunks",
                                        "offset":1
                                    },
                                    "_score": 0.7949572,
                                    "_source": {
                                        "payload": "Text 2 example",
                                        "id": 2
                                    }
                                },
                                {
                                    "_index": "index_name",
                                    "_id": "doc_id_1",
                                    "_nested": {
                                        "field": "chunks",
                                        "offset": 2
                                    },
                                    "_score": 0.75225127,
                                    "_source": {
                                        "payload": "Text 3 example",
                                        "id": 3
                                    }
                                }
                            ]
                        }
                    }
                }
            }
        ]
    }
}

Kovsonq avatar May 01 '24 09:05 Kovsonq

This issue is also biting me.

We have nested property which stores attachments on a document. We use the inner_hits today to reflect when the query was found in one of the attachments. However, in trying to implement a hybrid search which combines a simple_query_string with a neural_sparse search, we're losing the inner_hits, which means we cannot identify when the search came from our nested search.

dswitzer avatar May 06 '24 20:05 dswitzer

@dswitzer can we try 2 text queries with hybrid search and see if inner hits are coming or not. Reason I am asking this is for vector search there are improvements which are doing in 2.12 and 2.13 version relates to nested fields with vectors. Ref: https://github.com/opensearch-project/k-NN/issues/1447 Ref: https://github.com/opensearch-project/k-NN/issues/1065

navneet1v avatar May 14 '24 18:05 navneet1v

@navneet1v The issue persist even if it contains query with non-vector fields only. The issue with hybrid search with inner_hits is that, the innerHit result does not get generated at all.

heemin32 avatar May 17 '24 23:05 heemin32

@heemin32 thanks for confirming it. Can you please share the example on this issue on what and how you tested it.

navneet1v avatar May 18 '24 00:05 navneet1v

Create Index

PUT /my-hybrid
{
  "mappings": {
    "properties": {
      "chunks": {
        "type": "nested",
        "properties": {
          "embedding": {
            "type": "knn_vector",
            "dimension": 3,
            "method": {
              "name": "hnsw",
              "space_type": "cosinesimil",
              "engine": "nmslib",
              "parameters": {
                "ef_construction": 128,
                "m": 24
              }
            }
          },
          "payload": {
            "index": "true",
            "norms": "false",
            "store": "true",
            "type": "text"
          },
          "length": {
            "type": "integer"
          },
          "id": {
            "type": "text"
          }
        }
      }
    }
  },
  "settings": {
    "index": {
      "knn": true
    }
  }
}

Add doc

PUT /_bulk?refresh=true
{ "index": { "_index": "my-hybrid", "_id": "1" } }
{ "chunks": [{"id": 1, "length": 173, "payload": "Text 1 example", "tokens": 256, "embedding": [1, 1, 1]}, {"id": 2, "length": 173, "payload": "Text 2 example", "tokens": 256, "embedding": [2, 2, 2]},{"id": 3,"length": 173,"payload": "Text 3 example","tokens": 256,"embedding": [3, 3, 3]}]}

Query

GET /my-hybrid/_search
{
  "_source": false,
  "query": {
    "hybrid": {
      "queries": [
        {
          "nested": {
            "path": "chunks",
            "query": {
              "simple_query_string": {
                "query": "*",
                "fields": [
                  "chunks.payload"
                ],
                "default_operator": "and"
              }
            },
            "inner_hits": {
              "size": 10,
              "_source": {
                "includes": [
                  "chunks.payload",
                  "chunks.id"
                ]
              }
            }
          }
        }
      ]
    }
  }
}

Response

Expect innerHit field is included in the result but no innerHit appears in the result.

{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "my-hybrid",
        "_id": "1",
        "_score": -9549512000
      },
      {
        "_index": "my-hybrid",
        "_id": "1",
        "_score": -4422440400
      },
      {
        "_index": "my-hybrid",
        "_id": "1",
        "_score": 1
      },
      {
        "_index": "my-hybrid",
        "_id": "1",
        "_score": -9549512000
      }
    ]
  }
}

heemin32 avatar May 20 '24 17:05 heemin32

@Kovsonq @dswitzer what is the main use case for those inner hits returned in the result? How critical is the score information for that use case?

I spent some time checking what can be done for inner hits and our limitations. We can include an inner hits section in the response, similar to what's done for other queries in OpenSearch. The only limitation I'm seeing is with the scores. Inner hits have their own logic for retrieving scores; at a high level, they run a light version of the search again during the Fetch phase. At this point, the score normalization process for the hybrid query has been completed, and scores are updated in the query result section of the response. Scores added for inner hits will not be normalized but will be in raw form and scale. This means that, depending on the query, scores can be unbounded and will not correlate with the main hits in the query results (as those are normalized).

martin-gaievski avatar May 30 '24 17:05 martin-gaievski

@martin-gaievski,

My primary use case is to just be able to highlight the matching terms. The score of the inner hits does not matter much to me, because I'm just using it to highlight keyword matches.

dswitzer avatar May 30 '24 18:05 dswitzer

@martin-gaievski,

The primary use case for inner_hits in OpenSearch is to retrieve detailed matching information from nested objects within documents. This is particularly useful in scenarios where documents have complex structures with nested fields, and there is a need to understand which specific parts of these documents match the query criteria.

In the context of nested objects, score information for inner hits is important because it allows users to identify the most relevant chunks or sub-documents within a larger document. When a hybrid search is performed, having access to the scores of inner hits enables users to rank and prioritize these nested sections effectively.

Scenario: we need to return the top 20 most relevant nested documents (not parent documents) for the query.

Kovsonq avatar May 31 '24 08:05 Kovsonq

@Kovsonq I'm still not 100% understand why you need normalized scores in a final list of results. If we enable inner_scores without normalized scores, then relative order of child documents will be present in the final result list. As the inner_hits is passed at the sub_query level those hits for child documents will be local to that sub-query anyway, not global for all hybrid query. If you need to retrieve information about child documents with normalized scores then I feel those child document should be modeled as top level (-> parent) documents.

martin-gaievski avatar Jun 02 '24 01:06 martin-gaievski

After doing deep dive for this request I can conclude that we need more time and some additional mechanisms (most likely include core OpenSearch) to implement this feature correctly. Simplistic approach where inner hits are given per sub-query doesn't work and may provide false positives. Example scenario:

  • hybrid query has two sub-queries, one text match, second is neural query. user specify inner hits for match query
  • one document has low score in match query, say it's in position 12. At the same time same document has much better score in neural query - something like 0.95, position 2.
  • after doing normalization final position of that document is 3.
  • inner hits for the document will have information collected for match query

In result user may have false impression that high final position of the document in due to hits in match, but in reality it's neural that contributed the most. In other words, we need an inner hits concept at the high level hybrid query, not at the level of sub-query.

I've created issue in core OpenSearch for possible extension mechanisms https://github.com/opensearch-project/OpenSearch/issues/14546

martin-gaievski avatar Jun 28 '24 16:06 martin-gaievski

I'm also trying to do the same, it seems also that the normalization isn't being applied correctly for hybrid search on nested fields as well. I've verified for normalizing using all of the values of the nested field, using the highest value of the nested field for each doc, using the sum of the values of the nested field. The normalization just doesn't come out correctly.

For context my use case is to run hybrid search on chunks of documents and ideally I wouldn't need to create a new document in opensearch for every chunk that I want to index.

I believe this is a common use case, it would be super AMAZING if we could get this support!

yuhongsun96 avatar Jul 07 '24 20:07 yuhongsun96

Is there any blocking issue to support this feature? cc: @martin-gaievski @vibrantvarun

yuye-aws avatar Sep 09 '24 09:09 yuye-aws

@yuye-aws yes, there are fundamental blockers for inner hits: the process is split into two parts, first run at the shard level and doesn't have access to normalized scores and combined order of documents, second part is at the fetch phase and it's also at the shard level. Second item has additional problem of query and fetch phases not communicating with each other directly.

martin-gaievski avatar Sep 09 '24 15:09 martin-gaievski

@martin-gaievski Thanks for your prompt reply. Although I do not have much context for the inner hits and hybrid query, it really seems to be a tricky problem to resolve. Is there any existing info for me to get more knowledge? (Like PR https://github.com/opensearch-project/neural-search/pull/776)

yuye-aws avatar Sep 10 '24 02:09 yuye-aws

Still really excited to have this support! We're waiting for this to switch over to OpenSearch, it has everything else we need, but to hack around this to create our own implementation using just the top level docs is too messy.

yuhongsun96 avatar Sep 10 '24 03:09 yuhongsun96

Closing this issue as PR is merged.

vibrantvarun avatar Apr 07 '25 23:04 vibrantvarun