neural-search icon indicating copy to clipboard operation
neural-search copied to clipboard

[FEATURE] Implement pagination for Hybrid Search

Open martin-gaievski opened this issue 2 years ago • 21 comments

Is your feature request related to a problem?

Current implementation of Hybrid search doesn't have support for pagination, meaning all results are returned "at once". That is standard for many queries in OpenSearch and it's expected that Hybrid Search supports it, https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/.

What solution would you like?

It should be possible to define "from" and positions in as part of the Hybrid query, and results should have (from + size) number of records, starting from "from" position. Standard syntax should be fine in this case:

{
   "from": 20,
   "size": 10,
   "query": {
       "hybrid": [
           {},// First Query
           {} // Second Query
           ..... // Other Queries
       ] 
   }
}

What alternatives have you considered?

It's possible to set higher size and then throw first X elements, but that is required extra processing logic on a client size and is not very optimal.

martin-gaievski avatar Sep 04 '23 22:09 martin-gaievski

When can we expect this in OpenSearch?

ankitas3 avatar Apr 01 '24 08:04 ankitas3

@martin-gaievski @ankitas3 Just double checking, but from and size already seem to be working for me with Hybrid search while using Opensearch 2.11. Has it been implemented since this issue was raised?

jackh-ncl avatar Apr 18 '24 10:04 jackh-ncl

@jackh-ncl it has not been implemented yet. With existing code it may work, but not for every scenario and when it works it does it not in an optimal way.

martin-gaievski avatar Apr 18 '24 16:04 martin-gaievski

Aha thank you for confirming, yeah I have now noticed that I seem to consistently end up with a total hits value of 5 * the size parameter with the query I'm running.

jackh-ncl avatar Apr 18 '24 18:04 jackh-ncl

any chance we can get this in 2.14 or the version after that?

benmcginnis avatar Apr 26 '24 18:04 benmcginnis

Is anyone from opensearch able to provide any broad details around when pagination will be made available?

brandon-carag avatar May 07 '24 17:05 brandon-carag

+1 for this feature request

qmauret avatar May 20 '24 12:05 qmauret

@martin-gaievski Can you please clarify why it doesn't work optimally?

@jackh-ncl it has not been implemented yet. With existing code it may work, but not for every scenario and when it works it does it not in an optimal way.

mkerimyilmaz avatar May 30 '24 06:05 mkerimyilmaz

@martin-gaievski Any idea on when this feature will be available? Thanks

JPSoteloSilva avatar Jun 04 '24 11:06 JPSoteloSilva

Hi @vamshin and @martin-gaievski,

Just to briefly restate the case here, pagination is a critical feature for many users adopting hybrid search. There has been significant community interest in this thread (at least 20 distinct users) that also point to strong user demand.

I'm sure there are several competing dev priorities here, but this seems like a core item that shouldn't fall to the wayside. At the very least, can we get a very rough approximate timeline here? I suspect there are many people on this thread whose downstream roadmaps are reliant on what happens here.

Thanks in advance

brandon-carag avatar Jul 16 '24 17:07 brandon-carag

@brandon-carag Thank you for your interest in pagination functionality. We understand the importance of this feature, and it's on our roadmap. However, at the moment, we're focusing our efforts on enhancing the sorting and explain (raw scores for debugging) capabilities.

While we don't have a definite timeline for implementing pagination, we estimate it could be available towards the end of the year. Please note that this is a rough estimate, and the actual timeline may vary depending on our priorities and resource availability.

We're always open to collaboration and would be delighted if someone from the community is willing to contribute to this feature. If you or anyone else is interested in contributing, please feel free to reach out to us. We can provide guidance and support to ensure a smooth integration of the pagination functionality.

As a work around, is it possible to use a "size" parameter with a large value? I think we can get upto 10K results

vamshin avatar Jul 16 '24 19:07 vamshin

@vamshin Thanks for the prompt response--that's helpful info, the rough timeline you mentioned is useful to know. I believe martin mentioned above "it has not been implemented yet. With existing code it may work, but not for every scenario and when it works it does it not in an optimal way." I'm not exactly clear on the boundaries where the existing pagination logic breaks down in the existing implementation.

For even a relatively small corpus, it seems like specifying a large size and processing client-side won't scale for even a relatively small index, particularly if OpenSearch is being used to service API requests and apply pagination. As such, hoping this feature can emerge sooner rather than later.

brandon-carag avatar Jul 16 '24 19:07 brandon-carag

has anyone tried scroll api as an alternative? it works?

sonic182 avatar Aug 22 '24 09:08 sonic182

My understanding is that the scroll API won't solve this issue. Documentation states, "Because search contexts consume a lot of memory, we suggest you don’t use the scroll operation for frequent user queries. Instead, use the sort parameter with the search_after parameter to scroll responses for user queries." https://opensearch.org/docs/latest/api-reference/scroll/

brandon-carag avatar Aug 22 '24 17:08 brandon-carag

While we wait for pagination support in Hybrid queries, has anyone found some other way (even if it takes 2+ separate queries) to normalize results of neural+lexical query combo?

Really struggling to figure out the best way to combine semantic with lexical search, as the score of neural queries are [0..1] and don't influence the results much if at all.

If someone could point in the right direction I would really appreciate. Thank you

Romasato avatar Sep 06 '24 17:09 Romasato

Is there any update on this feature? It’s currently listed in the 'Upcoming release (TBD)' section, but I’d like to know which version it will be released under.

safakkbilici avatar Sep 24 '24 14:09 safakkbilici

@brandon-carag

For even a relatively small corpus, it seems like specifying a large size and processing client-side won't scale for even a relatively small index, particularly if OpenSearch is being used to service API requests and apply pagination. As such, hoping this feature can emerge sooner rather than later.

Could you share some info on number of documents per page, maximum number of pages, sample query, and acceptable latency?

When you say, specifying a large size and processing client-side won't scale, Is the issue in opensearch side or client side?

Can you do like 1. retrieve only document ids and cache it in client side, 2. get documents for each page using the doc ids?

heemin32 avatar Oct 09 '24 19:10 heemin32

Heemin, we appreciate you reaching out for our feedback. We use OpenSearch to power both our web UI and API responses. We limit the max results to 10k documents with a max per page size of 2k. However, our default response size is 20 results with the ability to paginate up to 50 pages (page size is not a requirement). This allows us to keep the typical query responsive while supporting use cases that need larger data sets.

We'd like to move from our current use of the standard lexical BM25 search to hybrid search in order to improve the quality of our responses. Without pagination, this transition is more difficult as it would require our API users to make changes to their requests in order to maintain the same behavior.

While we could implement a cache and retrieval mechanism outside of OpenSearch as you mentioned, it would add a lot of infrastructure overhead as we serve a very diverse set of queries with varying response sizes and a long tail.

Our document corpus is roughly 1 million documents ranging from 1 to ~400 pages. Large response size/pagination latency is less of a concern as these are generally automated API requests in our system. Additionally new documents are only indexed a few times a day so we are also less affected by inconsistent ordering due to index changes (we are aware the OpenSearch docs warn about results not being frozen in time).

A typical query might look like: UI generated: https://www.federalregister.gov/api/v1/documents.json?per_page=20&conditions%5Bterm%5D=hurricane API: https://www.federalregister.gov/api/v1/documents.json?per_page=1000&conditions%5Bterm%5D=hurricane

We're happy to answer any additional questions as needed. Thanks!

brandon-carag avatar Oct 09 '24 23:10 brandon-carag

@brandon-carag Thanks for the reply. Just to confirm with latency is less of concern, do you mean even if the latency of the first page is same as the latency of the last page, it is okay for your use case?

heemin32 avatar Oct 10 '24 00:10 heemin32

@heemin32 Sure--could you give us a very rough sense of what the latency difference might be for BM25-based pagination vs. hybrid-pagination? Additionally, if you could clarify a bit whether that would mean we'd see a substantial increase in latency for the first page of results for queries returning larger sets of results (eg > 10k) that would be helpful. If the overall latency increase wasn't substantial that would probably be fine for our use case. Thanks!

brandon-carag avatar Oct 10 '24 00:10 brandon-carag

@brandon-carag Regardless which page it is, the latency could be same as loading the last page. For example, with your BM25-based pagination, the latency for the first page could be same as the latency for the last page.

heemin32 avatar Oct 10 '24 00:10 heemin32

Please checkout RFC on pagination 933

vibrantvarun avatar Oct 15 '24 04:10 vibrantvarun