OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[Feature Request] Support more efficient encoding of filters over many terms

Open msfroh opened this issue 1 year ago • 1 comments

Is your feature request related to a problem? Please describe

I recently read this blog post, where the author claims a 10x speedup on a large terms query.

I believe that part of the improvement comes from the use of doc values to post-filter hits that come from a lead iterator, which is now the OpenSearch (starting with 2.12) thanks to @harshavamsi's changes in https://github.com/opensearch-project/OpenSearch/pull/11209 to support IndexOrDocValuesQuery for all numeric query types. (Behind the scenes, Lucene implements the DV query using a LongHashSet, which I think should perform similarly to RoaringBitmap.)

The more interesting part (IMO) is that the roaring bitmap of numeric terms gets created on the client and sent as a base64-encoded bitset, where it's used as the doc value filter.

Describe the solution you'd like

I'm wondering if we can similarly add some kind of query syntax to more efficiently receive large sets of numbers to match on. Could we apply something similar for large sets of text fields (e.g. supporting prefix-coded terms)?

Even if we don't do it between client and coordinator node, could we precompute a more efficient representation on the coordinator node before passing the query out to the shards?

Related component

Search:Query Capabilities

Describe alternatives you've considered

No response

Additional context

No response

msfroh avatar Feb 15 '24 20:02 msfroh

Of course, if we're concerned about sending massive queries from the client to the server, we can always use the "terms lookup" feature for a terms query.

More broadly, I wonder if there are any other query types where people might send very large sets of data that we could encode more efficiently from the client to the coordinator. Vector search, maybe?

msfroh avatar Feb 15 '24 23:02 msfroh

[Triage - attendees 1 2 3 4 5] @msfroh Thanks for creating this issue; however, it isn't being accepted due to not having a clear outcome - seems more like an RFC would be a clearer starting point. Please feel free to open a new issue after addressing the reason.

peternied avatar Feb 21 '24 16:02 peternied

I added a clearer outcome to the description and assigned the issue to myself.

msfroh avatar May 20 '24 22:05 msfroh

This would be a great enhancement (and differentiator). We, like others, have a use case which does not scale using terms or terms lookup, and joins don't perform well enough either. The ability to quickly test whether an integer record field value is included in a large list of integers (10K..10M in extreme cases) would be extremely valuable and avoid custom plugins. An approach using Roaring Bitmaps with client-side encoded query bitmaps would be even more flexible since one could express arbitrary bitset operations as query constraints and the data would be encoded and transported more efficiently.

ochrist-eis avatar Jun 07 '24 10:06 ochrist-eis

I'm starting to dive into this issue and taking guidance from @msfroh.

bowenlan-amzn avatar Jun 12 '24 17:06 bowenlan-amzn