IndexWriter forceMergeDeletes should return its MergeSpec
IndexWriter provides a forceMergeDeletes API which triggers force merging of all segments that have deleted documents, allowing users to expunge deletes up to a configurable delete percentage (set via setForceMergeDeletesPctAllowed()).
The API provides a blocking variant, which waits until the merges complete, and a non-blocking variant, that starts the merges in background threads and returns. For the non-blocking version, it would be nice to have the ability to monitor if merges have completed. Turns out, all we need for this, is to return the MergeSpecification that defines the merges triggered by the API.
Indeed, the blocking variant of this API itself uses this spec to wait until all merges have completed. This is what would happen if you were using the ConcurrentMergeScheduler which will start merges in background, but invoked the API with doWait=true. However, there are benefits to being able to monitor from outside the API, like waiting only unto a max timeout, or reporting metrics on the progress of these merges.
The change here is to change this API return type from void to MergePolicy.MergeSpecification and return the spec object.
This is a spin-off from #14431 which has more context on the use case.
Based on the detailed requirement, it seems like this is very first-contributor-friendly. Could I take this @vigyasharma ? My approach is:
- Return a
Optional<MergeSpecification>since the spec could be null - Update the method document
- Add test cases cover both blocking (optional object returned is empty) and non-blocking (we expect the spec in this test) cases
The discussion on #14431 converged on an alternative (thanks Adrien) without a need for this work.
However, I'm not against this simple change contribution. It could help applications track and emit monitoring metrics on merge progress. @jpountz / @mikemccand – Do you have any reservations with making this return type change in the API?
The idea makes sense to me, it's interesting to gain additional visibility into what merging decisions the merge policy actually makes (and specifically whether it thinks that some merging is required or not).
I'm not too happy about exposing MergeSpecification and OneMerge in public non-expert APIs, I wonder if we can clean them up (e.g. so that they are no longer mutable) or if we should make IndexWriter#forceMergeDeletes exposed a trimmed down version of these classes that just exposes the list of merged SegmentCommitInfos.
I'd like to work on this and wanted to check my approach first.
Looking at @jpountz's comment about the visibility (and mutability) issue - what if we add a simple MergeObserver wrapper class that holds the MergeSpecification internally but only exposes read-only methods? Things like checking merge count, waiting for completion, etc.
It would be in the same package, allowing it to call the existing package-private methods. forceMergeDeletes() would then wrap and return it (in place of void). The spec reference stays private to prevent any modifications.
Does it make sense?
Hi @vigyasharma,
I've implemented this in PR #15378.
I added MergePolicy.MergeObserver with methods for checking merge status, waiting synchronously or asynchronously, and accessing individual merges. The implementation is thread-safe, handles edge cases, and is fully backward compatible.
I included also tests that have been verified across thousands of iterations. Happy to address any feedback.