OpenSearch
OpenSearch copied to clipboard
[Feature Request] Provide capability for not adding top docs collector in the query search path
Is your feature request related to a problem? Please describe
Right now as part of search code path both Default and Concurrent QueryPhaseSearchers are adding TopDocsCollector as first collector in the chain (code ref1 DefaultQueryPhaseSearcher, code ref2 ConcurrentQueryPhaseSearcher).
In case of certain types of extensions that may lead to a performance degradation of search. In particular that's true if custom collector and collector manager are added and those collectors are used in a way that scores are collected with a custom logic. In such case some compute (like filtering) will be done twice and depending on the extension logic that can be a thrown away work.
Example of such extension is hybrid query. Please check collector manager and collector added in neural-search plugin.
Describe the solution you'd like
I suggest core add an interface or method in existing QueryPhaseSearcher interface (or public core class QueryPhaseSearcherWrapper) that allows extension or plugin to set a flag and depending on that flag the top docs collector will be added or skipped. Default implementation should allow adding of topdocscollector so system is backward compatible.
Related component
Search
Describe alternatives you've considered
Right now there isn't much of alternatives, custom collector is executed after the top docs collector. This leads to performance degradation of search queries, that has been submitted to us by multiple customers.
Additional context
Custom query collector and collector manager were added as part of the neural-search plugin in 2.13 release (original PR)
@martin-gaievski I don't think we need to modify QueryPhaseSearcher
but if needed you could (and probably should) provide your own implementation of QueryPhaseSearcher::searchWithCollector
@reta if we modify the QueryPhaseSearcher::searchWithCollector
then we will miss out on the optimizations/other functionalities happening in the SearchWithCollector functions or any new functionality that is getting added there.
I suggest core add an interface or method in existing QueryPhaseSearcher interface (or public core class QueryPhaseSearcherWrapper) that allows extension or plugin to set a flag and depending on that flag the top docs collector will be added or skipped. Default implementation should allow adding of topdocscollector so system is backward compatible.
On this I would suggest we can have a function returns the topDocsCollector instance, rather than a flag which says topdocscollector should be added or not.
On this I would suggest we can have a function returns the topDocsCollector instance, rather than a flag which says topdocscollector should be added or not.
@navneet1v we may have an overloaded version:
protected boolean searchWithCollector(
QueryCollectorContext queryCollectorContext,
SearchContext searchContext,
ContextIndexSearcher searcher,
Query query,
LinkedList<QueryCollectorContext> collectors,
boolean hasFilterCollector,
boolean hasTimeout
) throws IOException {
...
}
The current searchWithCollector(...)
would delegate to new searchWithCollector(...)
with TopDocsCollector
instance.
@navneet1v we may have an overloaded version:
@reta when you say we may have an overload version, are you saying we already have it? or we can build such a version? Little confused here. I think its the later one right?
@navneet1v sorry, I should have been more clear: we could add the overloaded version (no flags or functions).
@navneet1v sorry, I should have been more clear: we could add the overloaded version (no flags or functions).
Thanks. I myself don't agree with flags. But we can do override functions. :D Thanks @reta for helping out here.
@martin-gaievski can we start making the change.
It would be great if you folks can confirm my understanding of the approach with new overloaded function with additional param QueryCollectorContext
:
- the way we can skip top docs collector in custom implementation is if return
null
forTopDocsCollectorContext
, in such case it will not be added to the list of collector context objectsLinkedList<QueryCollectorContext> collectors
- function
searchWithCollector(...)
is part of both ConcurrentQPS (QPS stands for QueryPhaseSearcher) and DefaultQPS, not the QueryPhaseSearcher interface. To be able to take advantage of that overloaded function we need to extend both implementation of core QPS - 2 means that we need to use our implementations instead of those used in core today. For that we need to override
searchWith
from QueryPhaseSearcherWrapper in the plugin.
the way we can skip top docs collector in custom implementation is if return null for TopDocsCollectorContext, in such case it will not be added to the list of collector context objects LinkedList<QueryCollectorContext> collectors
:+1
function searchWithCollector(...) is part of both ConcurrentQPS (QPS stands for QueryPhaseSearcher) and DefaultQPS, not the QueryPhaseSearcher interface. To be able to take advantage of that overloaded function we need to extend both implementation of core QPS
That is correct, QueryPhaseSearcher
is unchanged, the idea is to allow reusing the implementation only
2 means that we need to use our implementations instead of those used in core today. For that we need to override searchWith from QueryPhaseSearcherWrapper in the plugin.
I think no, the implementation would follow this chain: searchWithCollector -> searchWithCollector -> new overloaded searchWithCollector
, so you would only need to override searchWithCollector
to pass null
as queryCollectorContext
@martin-gaievski i know you did some deep-dive on the impact of having topdocs collector and hybrid search collector both running. Can you paste the flame graph and benchmark results here for visibility.
Sure Navneet, after we've implemented one optimization on plugin side I got following numbers :
One sub-query that selects 11M documents
Bool: 84.7201
Hybrid: 256.799
One sub-query that selects 1.6K documents
Bool: 89.6258
Hybrid: 85.4563
Three sub-query that select 15M documents
Bool: 90.1481
Hybrid: 326.331
Based on profiling info (flamegraphs are attached) 45 to 80% percent of CPU time taken by the TopDocsCollector related methods, mainly by collect
method.
For reference: benchmark is done for 2.13 using noaa OSB workload. Following search queries used for benchmark
Bool queries:
Query 1
"size": 100,
"query": {
"bool": {
"should": [
{
"term": {
"station.country_code": "JA"
}
},
{
"range": {
"TRANGE": {
"gte": 0,
"lte": 30
}
}
},
{
"range": {
"date": {
"gte": "2016-06-04",
"format":"yyyy-MM-dd"
}
}
}
]
}
}
Query 2
"size": 100,
"query": {
"bool": {
"should": [
{
"range": {
"TRANGE": {
"gte": -100,
"lte": -50
}
}
}
]
}
}
Query 3
"size": 100,
"query": {
"bool": {
"should": [
{
"range": {
"TRANGE": {
"gte": 1,
"lte": 35
}
}
}
]
}
equivalent hybrid queres are:
Query 1
"size": 100,
"query": {
"hybrid": {
"queries": [
{
"term": {
"station.country_code": "JA"
}
},
{
"range": {
"TRANGE": {
"gte": 0,
"lte": 30
}
}
},
{
"range": {
"date": {
"gte": "2016-06-04",
"format":"yyyy-MM-dd"
}
}
}
]
}
}
Query 2
"size": 100,
"query": {
"hybrid": {
"queries": [
{
"range": {
"TRANGE": {
"gte": -100,
"lte": -50
}
}
}
]
}
Query 3
"size": 100,
"query": {
"hybrid": {
"queries": [
{
"range": {
"TRANGE": {
"gte": 1,
"lte": 35
}
}
}
]
}
Thanks martin for providing the details. Let start working on the POC to remove the topdocsCollector and see how much improvement we can get here.
Is there a scope of making this work little more generic?
I mean like to disable or enable any collector?
I mean like to disable or enable any collector?
@vibrantvarun I don't think the idea is to enable or disable any collector, instead the QueryCollectorContext
could be provided w/o query collectors. In that sense, the solution is generic.