lucene icon indicating copy to clipboard operation
lucene copied to clipboard

LUCENE-10357 Ghost fields and postings/points

Open shahrs87 opened this issue 2 years ago • 10 comments

Description

#11393 Please provide a short description of the changes you're making with this pull request.

Solution

Please provide a short description of the approach taken to implement your solution.

Tests

Please describe the tests you've developed or run to confirm this patch implements the feature or solves the problem.

Checklist

Please review the following and check all that apply:

  • [ ] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • [ ] I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
  • [ ] I have developed this patch against the main branch.
  • [ ] I have run ./gradlew check.
  • [ ] I have added tests for my changes.

shahrs87 avatar May 19 '22 20:05 shahrs87

Hi @jpountz, I have tried to put up a patch from the suggestions you have made in LUCENE-10357. Can you please review and provide feedback that I am going in the right direction ? Thank you.

shahrs87 avatar May 19 '22 21:05 shahrs87

Thanks for tackling this! Your are going in the right direction. Could we drop most of the if (terms != Terms.EMPTY) checks and handle them like any non-null Terms instance?

jpountz avatar May 20 '22 08:05 jpountz

@jpountz Made one more attempt to fix test failures. Can you please take a look again ? Thank you fir being patient with me.

shahrs87 avatar Jun 15 '22 19:06 shahrs87

Thanks @shahrs87, could you now try to remove all instances of if (terms == Terms.EMPTY)? Hopefully existing logic should work with Terms instances regardless of whether they are empty or not.

jpountz avatar Jun 17 '22 15:06 jpountz

could you now try to remove all instances of if (terms == Terms.EMPTY)?

@jpountz, I tried to remove all the instances of if (terms == Terms.EMPTY)? but couldn't remove the remaining ones in the patch. Otherwise it will cause test failures.

shahrs87 avatar Jun 30 '22 22:06 shahrs87

@jpountz There are no null or terms.EMPTY checks in CheckIndex class anymore.

shahrs87 avatar Jul 05 '22 19:07 shahrs87

@jpountz Hi Adrian, can you please make one more pass over the PR and provide your feedback ? Thank you.

shahrs87 avatar Jul 11 '22 16:07 shahrs87

@shahrs87 Can you look into removing all other instances of terms == Terms.EMPTY or terms != Terms.EMPTY as well? To do this while keeping tests passing, I think you'll need to create empty Terms instances that still honor the options of the FieldInfo as per my previous suggestion. E.g. you could add a new Terms#empty(FieldInfo) helper method that does the right thing for hasFreqs(), hasPositions(), etc. by looking at the index options of the FieldInfo.

jpountz avatar Jul 12 '22 09:07 jpountz

Thank you @jpountz for being so patient with me. I tried your above suggestion and hit the following problem. Locally I removed the following check from BloomFilteringPostingsFormat.java.

        if (result == Terms.EMPTY) {
          return Terms.EMPTY;
        }

The following test failed: org.apache.lucene.index.TestPostingsOffsets.testCrazyOffsetGap Can be reproduced by: gradlew :lucene:core:test --tests "org.apache.lucene.index.TestPostingsOffsets.testCrazyOffsetGap" -Ptests.jvms=8 -Ptests.jvmargs=-XX:TieredStopAtLevel=1 -Ptests.seed=66D42010A32F9625 -Ptests.locale=chr -Ptests.timezone=Asia/Jayapura -Ptests.gui=false -Ptests.file.encoding=UTF-8

The exception stack trace is:

field "foo" should have hasFreqs=true but got false
org.apache.lucene.index.CheckIndex$CheckIndexException: field "foo" should have hasFreqs=true but got false
	at __randomizedtesting.SeedInfo.seed([66D42010A32F9625:91A6069AD047326B]:0)
	at app//org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1434)
	at app//org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:2425)
	at app//org.apache.lucene.index.CheckIndex.testSegment(CheckIndex.java:999)
	at app//org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:714)
	at app//org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:552)
	at app//org.apache.lucene.tests.util.TestUtil.checkIndex(TestUtil.java:343)
	at app//org.apache.lucene.tests.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:909)
	at app//org.apache.lucene.index.TestPostingsOffsets.testCrazyOffsetGap(TestPostingsOffsets.java:462)

The terms object here is of type PerFieldPostingsFormat#FieldsReader The fieldsProducer object within PerFieldPostingsFormat#FieldsReader is of type BloomFilteringPostingsFormat#BloomFilteredFieldsProducer The delegateFieldsProducer within BloomFilteringPostingsFormat#BloomFilteredFieldsProducer is of type Lucene90BlockTreeTermsReader

This is the code snippet which I changed within Lucene90BlockTreeTermsReader#terms method

  @Override
  public Terms terms(String field) throws IOException {
    assert field != null;
    Terms terms = fieldMap.get(field);
    return terms == null ? Terms.EMPTY : terms;
  }

From your suggestion instead of returning Terms.EMPTY, I thought to return Terms.empty(fieldInfo) with overloaded hasFreqs, hasPositions, etc. methods. But the problem is there is no way to get hold of FieldsInfo object from field string. The fieldMap map within Lucene90BlockTreeTermsReader is empty. Is it ok to change the method argument for terms method from field String to fieldInfo object within Lucene90BlockTreeTermsReader ? public Terms terms(String field) throws IOException --> public Terms terms(FieldInfo fieldInfo) throws IOException I think NO but just wanted to ask.

This commit is the changes I made following your suggestion.

Please correct me if I am misunderstanding anything. Thank you again.

shahrs87 avatar Jul 13 '22 22:07 shahrs87

I was busy with some other security related work at my day job so couldn't update this PR. Apologies for that. @jpountz Can you please review this PR again ?

shahrs87 avatar Sep 23 '22 17:09 shahrs87

@jpountz Can you please review this patch again? Thank you

shahrs87 avatar Oct 18 '22 17:10 shahrs87

Apologies @shahrs87 , but after looking more at your changes, I'm getting worried that this change is harder than I had anticipated. I was optimistically hoping that never returning null Terms would automatically help save some NullPointerExceptions but looking at your change it looks like there are many places where we just need to replace these null checks with other checks if we want to keep the logic correct, so things wouldn't magically work on segments that have ghost fields unless we explicitly test ghost fields. So I would suggest closing this PR, apologies for the time you spent on it.

jpountz avatar Nov 10 '22 14:11 jpountz

Thank you for persisting so hard on this one @shahrs87 -- I'm sorry it looks like we should close it at this point, but your efforts / iterations were needed to see that we are mostly exchanging one if for another.

The whole null vs EMPTY is such a challenge (for Lucene anyways).

mikemccand avatar Nov 02 '23 11:11 mikemccand