lucene
lucene copied to clipboard
LUCENE-10357 Ghost fields and postings/points
Description
#11393 Please provide a short description of the changes you're making with this pull request.
Solution
Please provide a short description of the approach taken to implement your solution.
Tests
Please describe the tests you've developed or run to confirm this patch implements the feature or solves the problem.
Checklist
Please review the following and check all that apply:
- [ ] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
- [ ] I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
- [ ] I have developed this patch against the
main
branch. - [ ] I have run
./gradlew check
. - [ ] I have added tests for my changes.
Hi @jpountz, I have tried to put up a patch from the suggestions you have made in LUCENE-10357. Can you please review and provide feedback that I am going in the right direction ? Thank you.
Thanks for tackling this! Your are going in the right direction. Could we drop most of the if (terms != Terms.EMPTY)
checks and handle them like any non-null Terms
instance?
@jpountz Made one more attempt to fix test failures. Can you please take a look again ? Thank you fir being patient with me.
Thanks @shahrs87, could you now try to remove all instances of if (terms == Terms.EMPTY)
? Hopefully existing logic should work with Terms instances regardless of whether they are empty or not.
could you now try to remove all instances of if (terms == Terms.EMPTY)?
@jpountz, I tried to remove all the instances of if (terms == Terms.EMPTY)?
but couldn't remove the remaining ones in the patch. Otherwise it will cause test failures.
@jpountz There are no null or terms.EMPTY checks in CheckIndex class anymore.
@jpountz Hi Adrian, can you please make one more pass over the PR and provide your feedback ? Thank you.
@shahrs87 Can you look into removing all other instances of terms == Terms.EMPTY
or terms != Terms.EMPTY
as well? To do this while keeping tests passing, I think you'll need to create empty Terms
instances that still honor the options of the FieldInfo
as per my previous suggestion. E.g. you could add a new Terms#empty(FieldInfo)
helper method that does the right thing for hasFreqs()
, hasPositions()
, etc. by looking at the index options of the FieldInfo
.
Thank you @jpountz for being so patient with me. I tried your above suggestion and hit the following problem. Locally I removed the following check from BloomFilteringPostingsFormat.java.
if (result == Terms.EMPTY) {
return Terms.EMPTY;
}
The following test failed: org.apache.lucene.index.TestPostingsOffsets.testCrazyOffsetGap
Can be reproduced by:
gradlew :lucene:core:test --tests "org.apache.lucene.index.TestPostingsOffsets.testCrazyOffsetGap" -Ptests.jvms=8 -Ptests.jvmargs=-XX:TieredStopAtLevel=1 -Ptests.seed=66D42010A32F9625 -Ptests.locale=chr -Ptests.timezone=Asia/Jayapura -Ptests.gui=false -Ptests.file.encoding=UTF-8
The exception stack trace is:
field "foo" should have hasFreqs=true but got false
org.apache.lucene.index.CheckIndex$CheckIndexException: field "foo" should have hasFreqs=true but got false
at __randomizedtesting.SeedInfo.seed([66D42010A32F9625:91A6069AD047326B]:0)
at app//org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1434)
at app//org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:2425)
at app//org.apache.lucene.index.CheckIndex.testSegment(CheckIndex.java:999)
at app//org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:714)
at app//org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:552)
at app//org.apache.lucene.tests.util.TestUtil.checkIndex(TestUtil.java:343)
at app//org.apache.lucene.tests.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:909)
at app//org.apache.lucene.index.TestPostingsOffsets.testCrazyOffsetGap(TestPostingsOffsets.java:462)
The terms object here is of type PerFieldPostingsFormat#FieldsReader The fieldsProducer object within PerFieldPostingsFormat#FieldsReader is of type BloomFilteringPostingsFormat#BloomFilteredFieldsProducer The delegateFieldsProducer within BloomFilteringPostingsFormat#BloomFilteredFieldsProducer is of type Lucene90BlockTreeTermsReader
This is the code snippet which I changed within Lucene90BlockTreeTermsReader#terms
method
@Override
public Terms terms(String field) throws IOException {
assert field != null;
Terms terms = fieldMap.get(field);
return terms == null ? Terms.EMPTY : terms;
}
From your suggestion instead of returning Terms.EMPTY, I thought to return Terms.empty(fieldInfo) with overloaded hasFreqs, hasPositions, etc. methods. But the problem is there is no way to get hold of FieldsInfo
object from field
string. The fieldMap map within Lucene90BlockTreeTermsReader is empty. Is it ok to change the method argument for terms method from field String to fieldInfo object within Lucene90BlockTreeTermsReader ? public Terms terms(String field) throws IOException
--> public Terms terms(FieldInfo fieldInfo) throws IOException
I think NO but just wanted to ask.
This commit is the changes I made following your suggestion.
Please correct me if I am misunderstanding anything. Thank you again.
I was busy with some other security related work at my day job so couldn't update this PR. Apologies for that. @jpountz Can you please review this PR again ?
@jpountz Can you please review this patch again? Thank you
Apologies @shahrs87 , but after looking more at your changes, I'm getting worried that this change is harder than I had anticipated. I was optimistically hoping that never returning null Terms would automatically help save some NullPointerException
s but looking at your change it looks like there are many places where we just need to replace these null checks with other checks if we want to keep the logic correct, so things wouldn't magically work on segments that have ghost fields unless we explicitly test ghost fields. So I would suggest closing this PR, apologies for the time you spent on it.
Thank you for persisting so hard on this one @shahrs87 -- I'm sorry it looks like we should close it at this point, but your efforts / iterations were needed to see that we are mostly exchanging one if
for another.
The whole null
vs EMPTY
is such a challenge (for Lucene anyways).