[BUG] Wildcard sort error when `doc_value` enabled and `size` less than total docs
Describe the bug
Current wildcard field implementation uses SortedSet as the doc value type, which will trigger an dynamic pruning in lucene when the size is less than total docs and query contains sort.
Spefically, lucene expect the term dict also contains the doc values, but this is not true for wildcard, because it uses N-gram to generate terms, but full value in doc value, which will cause an IllegalState Exception
https://github.com/apache/lucene/blob/485141dd34ea866ad9dc59843770969d1b0c8fa2/lucene/core/src/java/org/apache/lucene/search/comparators/TermOrdValComparator.java#L569-L572
Related component
Search
To Reproduce
I add another yaml test case under my repo https://github.com/HUSTERGS/OpenSearch/commit/8c7cb9be881fc01ad5083de76bd052d21c495a4e
Expected behavior
Wildcard sort can work properly
Additional Details
Additional context
I'm thinking about backward compatibility, the simplest way is to change SortedSetDocValue to BinaryDocValue, but already written data might have problem when upgrade the OS, we can add some version check to solve this. Another way is to add the full value to the term dict explicitly by WildcardFieldTokenizer which may help address this problem
hi, this seems like an interesting issue and a great learning opportunity. I could look into it.
Oh... that is very interesting. @ajleong623 -- please feel free to nag me to review your fix once it's ready.
Both of @HUSTERGS's ideas are worth exploring, I think. I have another idea that may be practical, but I don't want to spoil it. 😁
For existing indices, maybe we can get around this issue by explicitly disabling skipping for the wildcard field.
@bugmakerrrrrr That is true. The wildcard field implementation might not be compatible with the dynamic pruning. I wanted to try out the other solutions, first.
@HUSTERGS @msfroh I am still working on the problem. I am just learning about how wildcards are actually parsed and how sort actually works at a deeper level. So, it looks like the comparison is not valid because in the terms are not parsed into the lucene terms dict.
Based on the solution mentioned above, when we’re adding the documents to the index, make sure that the N gram of the value is also added to the terms dict? (Let me know if this was the intended solution @HUSTERGS )
I am still a bit new to a lot of the codebase, so my understanding at the moment is a bit fuzzy, but I’ve been trying to dig through the path of the error to have a better understanding of what went wrong. But I think this is a great learning experience, and I’m glad I have this opportunity. I’ll plan to have the solution implementation by the middle of the week.
Based on the solution mentioned above, when we’re adding the documents to the index, make sure that the N gram of the value is also added to the terms dict? (Let me know if this was the intended solution @HUSTERGS )
@ajleong623 Yes, your understanding is correct! My previous thought is to add orignal full text into the term dict (just like how KeywordFieldMapper does), so we don't need to worry about version, old version index will still get the same error, and new version index can proceed normally, but the downside is that the index storage size might inscrease a little bit, there comes the other options which do not increase index size.
@HUSTERGS @msfroh I made a temporary solution to the issue. It took me a while to understand how the mappers are applied and what actually happens during indexing and dig through the code behind mappers and parsers. But I found that when the document is indexed, there is an Iterable of IndexableField objects to represent the fields in the document. So for WildcardFieldMappers, I added a new field to be applied to the document which would make it so that the value of the field would be available in the index for comparison.
The issue mentioned above is that this would increase the storage, but the temporary solution does work. How else would the comparator linked in the issue work if the full values are not indexed? The draft pull request is linked.
The issue mentioned above is that this would increase the storage, but the temporary solution does work. How else would the comparator linked in the issue work if the full values are not indexed? The draft pull request is linked.
@ajleong623, This is an interesting approach, and it's similar to what I had in mind in my vague comment above.
Specifically, I was thinking that we could just disable sorting for wildcard fields. If you do want to be able to sort (and perform more efficient term or terms queries), you could add a keyword subfield and sort on that. Of course, that means that you end up storing the doc values twice (once for the wildcard field and once for the keyword field). The index on the keyword field would hold the terms that you've added directly. So your implementation is a little more efficient. Interestingly, your implementation mixes the full field values in with the trigrams (since they're getting added to the same field).
@HUSTERGS, it sounds like @ajleong623 implemented what you described here. Does that sound correct?
@msfroh Sorry, I'm working on some other things these days. I looked into the PR (#18568), it looks like correct, but the test didn't pass on my local env (hope I'm not get anything wrong)
Maybe @ajleong623 can help double check the test, you can run the specific YamlTest with
./gradlew clean :rest-api-spec:yamlRestTest --tests "org.opensearch.test.rest.ClientYamlTestSuiteIT" -Dtests.rest.suite="search/270_wildcard_fieldtype_queries" under the project root path
My previous thought is to modify the WildcardFieldTokenizer to produce the tokenized "full value", but the new approach seems to be simpler. Or we should just disabling skipping for the wildcard field like @bugmakerrrrrr suggested above ?
@msfroh @HUSTERGS it has been a little bit of time since I looked at the pr, but what I wanted to do was make sure that the index had access to the terms. However, it seems like some tests are failing, so I have to revisit the pr and see what ended up breaking from the change. But if that does not work, I could disable sorting on wildcard fields altogether. I will finalize the changes before next week.
Also @msfroh I noticed you did a deep dive into indexing in OpenSearch that I watched recently. That really helped me with understanding how documents are processed and how the engine eventually processed the parsed documents. Are there any other resources and deep dives I can look into to understand the architecture of OpenSearch, better?
It seems the test now fails at this line (not the previous line): https://github.com/apache/lucene/blob/d8b52ade0caee2e0505eead83bd4d6be859a6472/lucene/core/src/java/org/apache/lucene/search/comparators/TermOrdValComparator.java#L593 Which means adding the full value to the term dict will not work, it requires all the indexed term have a cooresponding match in the docvalue (If am not get anything wrong).
I've locally tried the two approaches mentioned above, change the doc value type from SortedSetDocValuesField to BinaryDocValuesField or disabling skipping for Wildcard field, both worked.
@HUSTERGS What was the reason BinaryDocValuesField worked over StringField?
I also tried to change SortedSetDocValuesField to BinaryDocValuesField, but for some reason, I got an error at https://github.com/apache/lucene/blob/d8b52ade0caee2e0505eead83bd4d6be859a6472/lucene/core/src/java/org/apache/lucene/search/comparators/TermOrdValComparator.java#L167 when getSortedDocValues is called because it seems like the docvalues type of BINARY is not expected.
@HUSTERGS What was the reason
BinaryDocValuesFieldworked overStringField?
I think these are essentially two different ways to solve the problem, change the doc value type to BinaryDocValuesField will directly disable the skipping optimization used by lucene (which seems only works when the doc value is "SortedXXX"). Actually I previously also thought StringField should work, but it failed on another different assertion : )
I also tried to change
SortedSetDocValuesFieldtoBinaryDocValuesField, but for some reason, I got an error at https://github.com/apache/lucene/blob/d8b52ade0caee2e0505eead83bd4d6be859a6472/lucene/core/src/java/org/apache/lucene/search/comparators/TermOrdValComparator.java#L167 whengetSortedDocValuesis called because it seems like the docvalues type of BINARY is not expected.
Yeah, you also need to modify fielddataBuilder to make the BinaryDocValueField change work
https://github.com/opensearch-project/OpenSearch/blob/548e46791a0aa65067e13a6a8063e32252138f12/server/src/main/java/org/opensearch/index/mapper/WildcardFieldMapper.java#L367-L370
Also the DerivedSource related code should also be changed I think: https://github.com/opensearch-project/OpenSearch/blob/548e46791a0aa65067e13a6a8063e32252138f12/server/src/main/java/org/opensearch/index/mapper/WildcardFieldMapper.java#L932
@HUSTERGS I made the changes, and I believe it does work. I could push the changes to make sure the code also passing the past tests, but I do not feel great about taking credit for your solution. After I push the changes to make sure the Gradle check passes, would you like to link a pr to make sure you get credit?
@HUSTERGS I made the changes, and I believe it does work. I could push the changes to make sure the code also passing the past tests, but I do not feel great about taking credit for your solution. After I push the changes to make sure the Gradle check passes, would you like to link a pr to make sure you get credit?
HHH, that's totally fine, just go ahead : )
@HUSTERGS there is a feature to ensure multiple people can be credited for a commit. Is it okay if I coauthor you? https://docs.github.com/en/pull-requests/committing-changes-to-your-project/creating-and-editing-commits/creating-a-commit-with-multiple-authors
Which email and username should I use?
Sorry, I'm working on some other things these days.
@HUSTERGS I've seen your amazing recent work over on Lucene with making things branchless. It's so cool!
For anyone who's curious: https://github.com/apache/lucene/pull/14896
Which email and username should I use?
HUSTERGS and [email protected] is good to me, Thank you !
I've seen your amazing recent work over on Lucene with making things branchless. It's so cool!
Thanks ! It brings me a lot of fun to play with lucene and opensearch : )
@HUSTERGS I think you should be coauthored, now. Check out the pr.
Issue should be closed due to https://github.com/opensearch-project/OpenSearch/pull/18568.