OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[BUG] Wildcard sort error when `doc_value` enabled and `size` less than total docs

Open HUSTERGS opened this issue 7 months ago • 6 comments

Describe the bug

Current wildcard field implementation uses SortedSet as the doc value type, which will trigger an dynamic pruning in lucene when the size is less than total docs and query contains sort.

Spefically, lucene expect the term dict also contains the doc values, but this is not true for wildcard, because it uses N-gram to generate terms, but full value in doc value, which will cause an IllegalState Exception

Image

https://github.com/apache/lucene/blob/485141dd34ea866ad9dc59843770969d1b0c8fa2/lucene/core/src/java/org/apache/lucene/search/comparators/TermOrdValComparator.java#L569-L572

Related component

Search

To Reproduce

I add another yaml test case under my repo https://github.com/HUSTERGS/OpenSearch/commit/8c7cb9be881fc01ad5083de76bd052d21c495a4e

Expected behavior

Wildcard sort can work properly

Additional Details

Additional context I'm thinking about backward compatibility, the simplest way is to change SortedSetDocValue to BinaryDocValue, but already written data might have problem when upgrade the OS, we can add some version check to solve this. Another way is to add the full value to the term dict explicitly by WildcardFieldTokenizer which may help address this problem

HUSTERGS avatar Jun 07 '25 02:06 HUSTERGS

hi, this seems like an interesting issue and a great learning opportunity. I could look into it.

ajleong623 avatar Jun 11 '25 16:06 ajleong623

Oh... that is very interesting. @ajleong623 -- please feel free to nag me to review your fix once it's ready.

Both of @HUSTERGS's ideas are worth exploring, I think. I have another idea that may be practical, but I don't want to spoil it. 😁

msfroh avatar Jun 11 '25 23:06 msfroh

For existing indices, maybe we can get around this issue by explicitly disabling skipping for the wildcard field.

bugmakerrrrrr avatar Jun 12 '25 07:06 bugmakerrrrrr

@bugmakerrrrrr That is true. The wildcard field implementation might not be compatible with the dynamic pruning. I wanted to try out the other solutions, first.

ajleong623 avatar Jun 12 '25 17:06 ajleong623

@HUSTERGS @msfroh I am still working on the problem. I am just learning about how wildcards are actually parsed and how sort actually works at a deeper level. So, it looks like the comparison is not valid because in the terms are not parsed into the lucene terms dict.

Based on the solution mentioned above, when we’re adding the documents to the index, make sure that the N gram of the value is also added to the terms dict? (Let me know if this was the intended solution @HUSTERGS )

I am still a bit new to a lot of the codebase, so my understanding at the moment is a bit fuzzy, but I’ve been trying to dig through the path of the error to have a better understanding of what went wrong. But I think this is a great learning experience, and I’m glad I have this opportunity. I’ll plan to have the solution implementation by the middle of the week.

ajleong623 avatar Jun 16 '25 05:06 ajleong623

Based on the solution mentioned above, when we’re adding the documents to the index, make sure that the N gram of the value is also added to the terms dict? (Let me know if this was the intended solution @HUSTERGS )

@ajleong623 Yes, your understanding is correct! My previous thought is to add orignal full text into the term dict (just like how KeywordFieldMapper does), so we don't need to worry about version, old version index will still get the same error, and new version index can proceed normally, but the downside is that the index storage size might inscrease a little bit, there comes the other options which do not increase index size.

HUSTERGS avatar Jun 16 '25 06:06 HUSTERGS

@HUSTERGS @msfroh I made a temporary solution to the issue. It took me a while to understand how the mappers are applied and what actually happens during indexing and dig through the code behind mappers and parsers. But I found that when the document is indexed, there is an Iterable of IndexableField objects to represent the fields in the document. So for WildcardFieldMappers, I added a new field to be applied to the document which would make it so that the value of the field would be available in the index for comparison.

The issue mentioned above is that this would increase the storage, but the temporary solution does work. How else would the comparator linked in the issue work if the full values are not indexed? The draft pull request is linked.

ajleong623 avatar Jun 20 '25 08:06 ajleong623

The issue mentioned above is that this would increase the storage, but the temporary solution does work. How else would the comparator linked in the issue work if the full values are not indexed? The draft pull request is linked.

@ajleong623, This is an interesting approach, and it's similar to what I had in mind in my vague comment above.

Specifically, I was thinking that we could just disable sorting for wildcard fields. If you do want to be able to sort (and perform more efficient term or terms queries), you could add a keyword subfield and sort on that. Of course, that means that you end up storing the doc values twice (once for the wildcard field and once for the keyword field). The index on the keyword field would hold the terms that you've added directly. So your implementation is a little more efficient. Interestingly, your implementation mixes the full field values in with the trigrams (since they're getting added to the same field).

@HUSTERGS, it sounds like @ajleong623 implemented what you described here. Does that sound correct?

msfroh avatar Jul 15 '25 21:07 msfroh

@msfroh Sorry, I'm working on some other things these days. I looked into the PR (#18568), it looks like correct, but the test didn't pass on my local env (hope I'm not get anything wrong)

Maybe @ajleong623 can help double check the test, you can run the specific YamlTest with ./gradlew clean :rest-api-spec:yamlRestTest --tests "org.opensearch.test.rest.ClientYamlTestSuiteIT" -Dtests.rest.suite="search/270_wildcard_fieldtype_queries" under the project root path

My previous thought is to modify the WildcardFieldTokenizer to produce the tokenized "full value", but the new approach seems to be simpler. Or we should just disabling skipping for the wildcard field like @bugmakerrrrrr suggested above ?

HUSTERGS avatar Jul 17 '25 03:07 HUSTERGS

@msfroh @HUSTERGS it has been a little bit of time since I looked at the pr, but what I wanted to do was make sure that the index had access to the terms. However, it seems like some tests are failing, so I have to revisit the pr and see what ended up breaking from the change. But if that does not work, I could disable sorting on wildcard fields altogether. I will finalize the changes before next week.

Also @msfroh I noticed you did a deep dive into indexing in OpenSearch that I watched recently. That really helped me with understanding how documents are processed and how the engine eventually processed the parsed documents. Are there any other resources and deep dives I can look into to understand the architecture of OpenSearch, better?

ajleong623 avatar Jul 17 '25 04:07 ajleong623

It seems the test now fails at this line (not the previous line): https://github.com/apache/lucene/blob/d8b52ade0caee2e0505eead83bd4d6be859a6472/lucene/core/src/java/org/apache/lucene/search/comparators/TermOrdValComparator.java#L593 Which means adding the full value to the term dict will not work, it requires all the indexed term have a cooresponding match in the docvalue (If am not get anything wrong).

I've locally tried the two approaches mentioned above, change the doc value type from SortedSetDocValuesField to BinaryDocValuesField or disabling skipping for Wildcard field, both worked.

HUSTERGS avatar Jul 17 '25 06:07 HUSTERGS

@HUSTERGS What was the reason BinaryDocValuesField worked over StringField?

I also tried to change SortedSetDocValuesField to BinaryDocValuesField, but for some reason, I got an error at https://github.com/apache/lucene/blob/d8b52ade0caee2e0505eead83bd4d6be859a6472/lucene/core/src/java/org/apache/lucene/search/comparators/TermOrdValComparator.java#L167 when getSortedDocValues is called because it seems like the docvalues type of BINARY is not expected.

ajleong623 avatar Jul 17 '25 06:07 ajleong623

@HUSTERGS What was the reason BinaryDocValuesField worked over StringField?

I think these are essentially two different ways to solve the problem, change the doc value type to BinaryDocValuesField will directly disable the skipping optimization used by lucene (which seems only works when the doc value is "SortedXXX"). Actually I previously also thought StringField should work, but it failed on another different assertion : )

I also tried to change SortedSetDocValuesField to BinaryDocValuesField, but for some reason, I got an error at https://github.com/apache/lucene/blob/d8b52ade0caee2e0505eead83bd4d6be859a6472/lucene/core/src/java/org/apache/lucene/search/comparators/TermOrdValComparator.java#L167 when getSortedDocValues is called because it seems like the docvalues type of BINARY is not expected.

Yeah, you also need to modify fielddataBuilder to make the BinaryDocValueField change work https://github.com/opensearch-project/OpenSearch/blob/548e46791a0aa65067e13a6a8063e32252138f12/server/src/main/java/org/opensearch/index/mapper/WildcardFieldMapper.java#L367-L370

Also the DerivedSource related code should also be changed I think: https://github.com/opensearch-project/OpenSearch/blob/548e46791a0aa65067e13a6a8063e32252138f12/server/src/main/java/org/opensearch/index/mapper/WildcardFieldMapper.java#L932

HUSTERGS avatar Jul 17 '25 07:07 HUSTERGS

@HUSTERGS I made the changes, and I believe it does work. I could push the changes to make sure the code also passing the past tests, but I do not feel great about taking credit for your solution. After I push the changes to make sure the Gradle check passes, would you like to link a pr to make sure you get credit?

ajleong623 avatar Jul 17 '25 17:07 ajleong623

@HUSTERGS I made the changes, and I believe it does work. I could push the changes to make sure the code also passing the past tests, but I do not feel great about taking credit for your solution. After I push the changes to make sure the Gradle check passes, would you like to link a pr to make sure you get credit?

HHH, that's totally fine, just go ahead : )

HUSTERGS avatar Jul 18 '25 07:07 HUSTERGS

@HUSTERGS there is a feature to ensure multiple people can be credited for a commit. Is it okay if I coauthor you? https://docs.github.com/en/pull-requests/committing-changes-to-your-project/creating-and-editing-commits/creating-a-commit-with-multiple-authors

Which email and username should I use?

ajleong623 avatar Jul 18 '25 14:07 ajleong623

Sorry, I'm working on some other things these days.

@HUSTERGS I've seen your amazing recent work over on Lucene with making things branchless. It's so cool!

For anyone who's curious: https://github.com/apache/lucene/pull/14896

msfroh avatar Jul 18 '25 18:07 msfroh

Which email and username should I use?

HUSTERGS and [email protected] is good to me, Thank you !

I've seen your amazing recent work over on Lucene with making things branchless. It's so cool!

Thanks ! It brings me a lot of fun to play with lucene and opensearch : )

HUSTERGS avatar Jul 19 '25 03:07 HUSTERGS

@HUSTERGS I think you should be coauthored, now. Check out the pr.

ajleong623 avatar Jul 19 '25 05:07 ajleong623

Issue should be closed due to https://github.com/opensearch-project/OpenSearch/pull/18568.

ajleong623 avatar Sep 12 '25 17:09 ajleong623