lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Support deletions in rearrange (#11814)

Open stefanvodita opened this issue 3 years ago • 4 comments

Description

Changes made to support deletes in IndexRearranger:

  1. Add isDeleted(LeafReader reader, int idx) in document selectors, which returns true if the document found at index idx in the leaf reader was marked for deletion in the original index.
  2. Record all docs in the DocumentSelector (not just live ones).
  3. When rearranging, in addOneSegment, add all docs that belong in the segment and then delete the ones that were originally marked for deletion.

Testing

The new test class is copied from TestIndexRearranger with modifications for deleting documents in the original index and testing that they are still present and deleted in the rearranged index. In a subsequent revision, this file would be merged into TestIndexRearranger.

#11814

stefanvodita avatar Sep 24 '22 14:09 stefanvodita

Thank you for the review @zhaih ! Your feedback is much appreciated. I left a detailed response to your comment on the code. Let me know what you think.

stefanvodita avatar Sep 26 '22 20:09 stefanvodita

I think the overall process in my mind would be:

  1. Identify all deletes and save them
  2. Create a document selector which ignores all deletes in the original index
  3. Rearrange
  4. Apply deletes

We can create a separate tool which do the step 1 and 4, or, which I think probably is more elegant, we can pass in another single DocumentSelector, which basically select which documents should be deleted afterwards. And we apply the deletions to the whole index after rearrange?

zhaih avatar Sep 26 '22 21:09 zhaih

we can pass in another single DocumentSelector, which basically select which documents should be deleted afterwards. And we apply the deletions to the whole index after rearrange?

I like the idea. I'll build a new revision following that line of thinking.

stefanvodita avatar Sep 26 '22 21:09 stefanvodita

The second revision comes with a lot more changes to support selecting deletes in the same fashion as segment content. I’ve reworked the tests to be more thorough, especially about deletes. I think the new tests cover all the cases that the old tests did, so I replaced the old ones.

This change is technically not backwards compatible. Not just because of changes to the rearrange API, but also because now we no longer make the deletes disappear from the rearranged index. They become live instead.

Once this PR goes through, I’ll start working on a change for luceneutil to use the new functionality.

stefanvodita avatar Oct 14 '22 06:10 stefanvodita

This change is technically not backwards compatible. Not just because of changes to the rearrange API, but also because now we no longer make the deletes disappear from the rearranged index. They become live instead.

I think this is OK -- this is a new tool, not heavily used. And keeping the deletes live is the purpose of this change :)

And IndexRearranger is already marked with @lucene.experimental, warning users that we might make non-backwards-compatible changes.

mikemccand avatar Oct 21 '22 10:10 mikemccand

What happens if the delete selector deletes 100% of the documents in a segment?

IndexWriter would normally drop such segments ... does it do so in this case too? Indeed, testDeleteEverything seems to confirm it does, great!

Maybe explain this caveat in the javadocs? I.e. one cannot produce a segment geometry that includes segments with 100% deleted documents.

mikemccand avatar Oct 21 '22 10:10 mikemccand

Thank you for the review @mikemccand, I’ve incorporated your feedback into the new commit.

stefanvodita avatar Oct 28 '22 13:10 stefanvodita

@zhaih @mikemccand - is there anything more you'd like to see in this PR?

stefanvodita avatar Nov 17 '22 09:11 stefanvodita

Thanks for the reminder/ping @stefanvodita! The latest iteration looks great to me -- I'll wait for a few days to see if @zhaih has more feedback and then merge. It's exciting that Lucene can so accurately reproduce an index with precisely the same assignment of documents to segments, and now also with deletions. Next up might be with doc-values updates too! Maybe open a follow-on issue for this?

mikemccand avatar Nov 17 '22 10:11 mikemccand

Thank you for the ideas @zhaih , I’ve incorporated all of them. Your suggestion of using ByteRef in the keyset led to me finding a bug in the tests. The index creation methods didn’t delete any docs because I had stored the "delete" marker in a BinaryDocValuesField. That’s fixed now and the tests are more thorough with their checks.

@mikemccand , I don’t know how doc-values updates work. Why would they not be covered by this solution?

stefanvodita avatar Nov 20 '22 23:11 stefanvodita