[Bug]: Intermittent UnsupportedOperationException errors with nested queries
Describe the bug
We use a boolean query that involves a nested field, like this:
"query":{
"bool":{
"filter":[
{
"terms":{
"name.keyword":["my-test-name"]
}
},
{
"nested":{
"path":"foo",
"query":{
"match_all":{}
},
"inner_hits":{
"size":256,
"sort":[
{
"foo.bar":"desc"
}
]
}
}
}
]
}
},
We seem to be hitting the error described in this forum post, where this query gives us intermittent UnsupportedOperationException errors. Have others run into this? Does anybody have more information about how to avoid or debug these errors?
Here is the the stacktrace from the OpenSearch logs:
Caused by: NotSerializableExceptionWrapper[unsupported_operation_exception: null] at org.opensearch.index.fielddata.AbstractNumericDocValues.advance(AbstractNumericDocValues.java:60) at org.apache.lucene.search.comparators.NumericComparator$NumericLeafComparator$2.advance(NumericComparator.java:416) at org.apache.lucene.search.ConjunctionBulkScorer.score(ConjunctionBulkScorer.java:162) at org.opensearch.search.internal.CancellableBulkScorer.score(CancellableBulkScorer.java:71) at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:38) at org.opensearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:338) at org.opensearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:289) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:560) at org.opensearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:361) at org.opensearch.search.query.QueryPhase$DefaultQueryPhaseSearcher.searchWithCollector(QueryPhase.java:468) at org.opensearch.search.query.QueryPhase$DefaultQueryPhaseSearcher.searchWithCollector(QueryPhase.java:456) at org.opensearch.search.query.QueryPhase$DefaultQueryPhaseSearcher.searchWith(QueryPhase.java:438) at org.opensearch.search.query.QueryPhaseSearcherWrapper.searchWith(QueryPhaseSearcherWrapper.java:60) at org.opensearch.neuralsearch.search.query.HybridQueryPhaseSearcher.searchWith(HybridQueryPhaseSearcher.java:61) at org.opensearch.search.query.QueryPhase.executeInternal(QueryPhase.java:284) at org.opensearch.search.query.QueryPhase.execute(QueryPhase.java:157) at org.opensearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:643) at org.opensearch.search.SearchService.executeQueryPhase(SearchService.java:707) at org.opensearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:676) at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:74) at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89) at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) at org.opensearch.threadpool.TaskAwareRunnable.doRun(TaskAwareRunnable.java:78) at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:59) at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1023) at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at java.lang.Thread.run(Thread.java:1583)
To reproduce
Use a boolean query with a nested path like the query above.
Expected behavior
No intermittent failures
Screenshots
No response
Host / Environment
No response
Additional context
No response
Relevant log output
Interesting ... that implementation of advance on AbstractNumericDocValues should theoretically never get called, because every possible subclass should either override it or guarantee that it doesn't get called (by only getting used for "safe" cases).
The Javadoc says:
* Base implementation that throws an {@link IOException} for the
* {@link DocIdSetIterator} APIs. This impl is safe to use for sorting and
* aggregations, which only use {@link #advanceExact(int)} and
* {@link #longValue()}.
*
* In case when optimizations based on point values are used, the {@link #advance(int)}
* and, optionally, {@link #cost()} have to be implemented as well.
In this case, the doc values are clearly be used in the query (calling advance), related to finding a competitive value.
I tried removing the "implementation" of advance from AbstractNumericDocValues to see what fails to compile and got the following output:
server/src/main/java/org/opensearch/search/MultiValueMode.java:555: error: <anonymous org.opensearch.search.MultiValueMode$6> is not abstract and does not override abstract method advance(int) in DocIdSetIterator
return new AbstractNumericDocValues() {
^
server/src/main/java/org/opensearch/search/MultiValueMode.java:609: error: <anonymous org.opensearch.search.MultiValueMode$7> is not abstract and does not override abstract method advance(int) in DocIdSetIterator
return new AbstractNumericDocValues() {
^
server/src/main/java/org/opensearch/search/aggregations/bucket/sampler/DiversifiedBytesHashSamplerAggregator.java:128: error: <anonymous org.opensearch.search.aggregations.bucket.sampler.DiversifiedBytesHashSamplerAggregator$DiverseDocsDeferringCollector$ValuesDiversifiedTopDocsCollector$1> is not abstract and does not override abstract method advance(int) in DocIdSetIterator
return new AbstractNumericDocValues() {
^
server/src/main/java/org/opensearch/search/aggregations/bucket/sampler/DiversifiedMapSamplerAggregator.java:138: error: <anonymous org.opensearch.search.aggregations.bucket.sampler.DiversifiedMapSamplerAggregator$DiverseDocsDeferringCollector$ValuesDiversifiedTopDocsCollector$1> is not abstract and does not override abstract method advance(int) in DocIdSetIterator
return new AbstractNumericDocValues() {
^
server/src/main/java/org/opensearch/search/aggregations/bucket/sampler/DiversifiedNumericSamplerAggregator.java:125: error: <anonymous org.opensearch.search.aggregations.bucket.sampler.DiversifiedNumericSamplerAggregator$DiverseDocsDeferringCollector$ValuesDiversifiedTopDocsCollector$1> is not abstract and does not override abstract method advance(int) in DocIdSetIterator
return new AbstractNumericDocValues() {
^
server/src/main/java/org/opensearch/search/aggregations/bucket/sampler/DiversifiedOrdinalsSamplerAggregator.java:123: error: <anonymous org.opensearch.search.aggregations.bucket.sampler.DiversifiedOrdinalsSamplerAggregator$DiverseDocsDeferringCollector$ValuesDiversifiedTopDocsCollector$1> is not abstract and does not override abstract method advance(int) in DocIdSetIterator
return new AbstractNumericDocValues() {
^
server/src/main/java/org/opensearch/search/aggregations/bucket/sampler/DiversifiedOrdinalsSamplerAggregator.java:141: error: <anonymous org.opensearch.search.aggregations.bucket.sampler.DiversifiedOrdinalsSamplerAggregator$DiverseDocsDeferringCollector$ValuesDiversifiedTopDocsCollector$2> is not abstract and does not override abstract method advance(int) in DocIdSetIterator
return new AbstractNumericDocValues() {
^
server/src/main/java/org/apache/lucene/search/grouping/CollapsingDocValuesSource.java:119: error: <anonymous org.apache.lucene.search.grouping.CollapsingDocValuesSource$Numeric$1> is not abstract and does not override abstract method advance(int) in DocIdSetIterator
values = new AbstractNumericDocValues() {
^
My hunch is that the problem is coming from one of the first two implementations (the anonymous classes in MultiValueMode), since the request doesn't involve diversified sampler aggregators and there's no field collapsing.
In particular, I'm looking at the implementation on line 609, because that includes a parent bitset and child DocIdSetIterator, which are used to evaluate nested queries. I think a possible advance method should delegate to advanceExact. Since advanceExact always returns true, I think advance can return its input (since somehow this implementation always has the thing we're trying to advance to).
@lizjackson-toast -- which OpenSearch version are you using? Are you able to apply that fix to MultiValueMode and see if it eliminates the problem at your end? Thanks!
Incidentally, this looks related to https://github.com/opensearch-project/OpenSearch/pull/12089, which was released in 2.12.
Thanks @msfroh for the quick response! I appreciate that. We are using OpenSearch version 2.17.0.
You mention line 609, but here in the latest commit on MultiValueMode, line 609 is just int count = 0. Can you clarify which line(s) need to be updated to allow advance to delegate to advanceExact in the fix you have in mind?
Thanks again!
Oh, I think you may mean line 690 and not 609 – is that right?
If so, if I interpret correctly, you're suggesting that we change this:
@Override
public int advance(int target) throws IOException {
return values.advance(target);
}
To this:
@Override
public int advance(int target) throws IOException {
return values.advanceExact(target);
}
Is that right? Do you have any docs about how we can test this? We appreciate your help!
Oh, I think you may mean line 690 and not 609 – is that right?
I was looking at the 2.13 branch initially, to try to see if I could connect things to the stack trace in the linked forum post -- though I now notice that the logs point to 2.16.
Anyway, it looks like the AbstractNumericDocValues implementation that I was talking about yesterday has moved down to line 812.
My concern is around the advance logic and how it should interact with parent documents, when there's nesting. I see that @reta handled a similar case around NumericDoubleValues here. Now I'm wondering if that implementation is correct.
Specifically, for each of the select() methods where theres a parentDoc bitset passed in, I feel like the returned doc values should implement advance() like:
@Override
public int advance(int target) throws IOException {
if (advanceExact(target)) {
return target;
}
throw new IllegalStateException("advanceExact should always return true");
}
(That will work for all of the select() implementations in that class except the SortedDocValues version at line 1183, but that's okay. We only need advance() for numeric iterators, and SortedDocValues is for strings.)
@reta -- do you remember if you looked into the nested docs case when you worked on https://github.com/opensearch-project/OpenSearch/pull/12089 ? I skimmed through it and didn't see anything, but I might have missed it.
@reta -- do you remember if you looked into the nested docs case when you worked on #12089 ? I skimmed through it and didn't see anything, but I might have missed it.
@msfroh I definitely not looked into nested docs case, there is a miss on my side :( we apparently have no test cases that manifest the problem with nested docs, partially to justify a miss here.
Thanks, @msfroh and @reta! In terms of what my own team should do next, is this an issue you'll look into fixing on your end, would you like to collaborate on a fix, something else? Thanks again for looking into this!
Thanks, @msfroh and @reta! In terms of what my own team should do next, is this an issue you'll look into fixing on your end, would you like to collaborate on a fix, something else? Thanks again for looking into this!
Thanks @lizjackson-toast , if your team could submit a fix, that would be just great!
If you need any help to get started on a fix, please let us know!
Thanks @msfroh! I'm hoping to start on a fix for this within the next week or so, but I can't find any information about how to test the change. I see here that there are various test files I can update, but how do people manually test their changes to ensure the change has the effect they want?
I see here that there are various test files I can update, but how do people manually test their changes to ensure the change has the effect they want?
For these kind of intermittent failures, it can be difficult. In particular, there might be some specific condition that causes the failure -- in this case, it might have to do with the order of the seen values from a numeric field, since I think it's trying to "prune" (i.e. skip values above/below some threshold based on the "best" seen so far), since that's usually what NumericComparator does.
You might have the most luck with a Java integration test with some random values. Looking at your example above, I guess you need a mapping where the top-level docs have a keyword field and a nested field, where the nested field has a numeric field that you're going to sort on. (I'm not sure which numeric type you're using, but it would be good to test as many as possible.) I would generate a bunch of random docs in a for-loop and pass them to the indexRandom method, which helpfully introduces a bit of chaos (shuffling them into a random order and flushing at random times in the middle of the indexing). For the random docs, I would use a randomBoolean() to generate the keyword field so they match or don't. For the numeric field, you might not actually need to go random, since the docs are going to be shuffled anyway. Maybe you could just pick values based on the loop index? For an example of a Java integration test, take a look at SimpleSearchIT. A lot of the test methods there have the kind of workflow you need: 1) create an index, 2) index some documents, 3) run some searches, 4) assert something about the search results.
Feel free to reach out if you run into any roadblocks!
I believe we have also ran into this same issue. We managed to get around the issue by adding missing: "_first" to our sort. The field we were sorting on was of type long and is always present in the document (so the missing arg made no difference to our result) - note, missing: "_last" did not help, we still got the exception.
Another interesting point is that it only affected us for single field sorts (i.e. sort array with only a single value) - sorts on multiple fields were unaffected.
Thanks @md384! We just tried this ourselves and unfortunately the OpenSearch 500s didn't go away. What version of OpenSearch are you using?
Unfortunately our team hasn't been able to prioritize fixing the OpenSearch problem ourselves, primarily because we're not sure how to test the change with confidence. @msfroh sent some thorough info, but it's still tough to confirm with manual testing (in addition to an integration test) that the change makes our errors go away. @msfroh if you're able to tackle this on your end, definitely post here, it sounds like various organizations are running into this bug!
Thanks @md384! We just tried this ourselves and unfortunately the OpenSearch 500s didn't go away. What version of OpenSearch are you using?
We're using 2.17 via Amazon OpenSearch Service
We added the missing: "_first" argument to our queries and this finally made our 500 errors go away! Thanks for the suggestion @md384.
It would still of course be great if someone somewhere fixes the root issue on the OpenSearch side, but we've at least got our errors sorted out for now. (We are on OpenSearch 2.17.)
We have this bug (we were getting breakages yesterday, it worked this morning, now it's broken again; we may have added some documents but we haven't changed the mapping or rebuilt the index in the interim) and indeed adding "missing": "_first" to the search directive in the console fixes it. Interestingly "missing": "_last" gets us the crash again. Fortunately this field is always populated in our index so that's not an issue.
We're on 3.1 via AWS's managed OpenSearch instance. If anyone trying to fix this would like more information about our setup I'm happy to help.
We encountered the same issue with OpenSearch 2.18. Our sort criteria is a date that may or may not be set . The "missing": "_first" fix works but the response is changed.
Setting "track_total_hits": true in the search requests also seems to resolve the shards failures with "unsupported_operation_exception" and prevented further issues.