lucene
lucene copied to clipboard
Support deletions in rearrange (#11814)
Description
Changes made to support deletes in IndexRearranger:
- Add
isDeleted(LeafReader reader, int idx)in document selectors, which returns true if the document found at indexidxin the leaf reader was marked for deletion in the original index. - Record all docs in the
DocumentSelector(not just live ones). - When rearranging, in
addOneSegment, add all docs that belong in the segment and then delete the ones that were originally marked for deletion.
Testing
The new test class is copied from TestIndexRearranger with modifications for deleting documents in the original index and testing that they are still present and deleted in the rearranged index. In a subsequent revision, this file would be merged into TestIndexRearranger.
#11814
Thank you for the review @zhaih ! Your feedback is much appreciated. I left a detailed response to your comment on the code. Let me know what you think.
I think the overall process in my mind would be:
- Identify all deletes and save them
- Create a document selector which ignores all deletes in the original index
- Rearrange
- Apply deletes
We can create a separate tool which do the step 1 and 4, or, which I think probably is more elegant, we can pass in another single DocumentSelector, which basically select which documents should be deleted afterwards. And we apply the deletions to the whole index after rearrange?
we can pass in another single DocumentSelector, which basically select which documents should be deleted afterwards. And we apply the deletions to the whole index after rearrange?
I like the idea. I'll build a new revision following that line of thinking.
The second revision comes with a lot more changes to support selecting deletes in the same fashion as segment content. I’ve reworked the tests to be more thorough, especially about deletes. I think the new tests cover all the cases that the old tests did, so I replaced the old ones.
This change is technically not backwards compatible. Not just because of changes to the rearrange API, but also because now we no longer make the deletes disappear from the rearranged index. They become live instead.
Once this PR goes through, I’ll start working on a change for luceneutil to use the new functionality.
This change is technically not backwards compatible. Not just because of changes to the rearrange API, but also because now we no longer make the deletes disappear from the rearranged index. They become live instead.
I think this is OK -- this is a new tool, not heavily used. And keeping the deletes live is the purpose of this change :)
And IndexRearranger is already marked with @lucene.experimental, warning users that we might make non-backwards-compatible changes.
What happens if the delete selector deletes 100% of the documents in a segment?
IndexWriter would normally drop such segments ... does it do so in this case too? Indeed, testDeleteEverything seems to confirm it does, great!
Maybe explain this caveat in the javadocs? I.e. one cannot produce a segment geometry that includes segments with 100% deleted documents.
Thank you for the review @mikemccand, I’ve incorporated your feedback into the new commit.
@zhaih @mikemccand - is there anything more you'd like to see in this PR?
Thanks for the reminder/ping @stefanvodita! The latest iteration looks great to me -- I'll wait for a few days to see if @zhaih has more feedback and then merge. It's exciting that Lucene can so accurately reproduce an index with precisely the same assignment of documents to segments, and now also with deletions. Next up might be with doc-values updates too! Maybe open a follow-on issue for this?
Thank you for the ideas @zhaih , I’ve incorporated all of them. Your suggestion of using ByteRef in the keyset led to me finding a bug in the tests. The index creation methods didn’t delete any docs because I had stored the "delete" marker in a BinaryDocValuesField. That’s fixed now and the tests are more thorough with their checks.
@mikemccand , I don’t know how doc-values updates work. Why would they not be covered by this solution?