lucene A multi-tenant ConcurrentMergeScheduler

Description

ConcurrentMergeScheduler computes max thread counts assuming a single IndexWriter in the JVM. But it's common with Solr or Elasticsearch to have tens of active IndexWriters running in the same JVM. Could we make ConcurrentMergeScheduler more multi-tenant, or introduce a new multi-tenant merge scheduler that accounts for the fact that the JVM is shared across multiple indexes?

Oct 10 '24 13:10 jpountz

Me, @lukewilner, @atharvkashyap, and @N624-debu are students from Carnegie Mellon University, and we’ll be working on this issue as part of a mentored summer course focused on collaboration in open-source software. Our mentors are @mikemccand and @vigyasharma. We’ll be drafting a plan and submitting PRs over the next few weeks. Looking forward to collaborating!

Our understanding of the problem:

Every IndexWriter within a running JVM initiates one ConcurrentMergeScheduler object that, based on the selected MergePolicy, uses available resources to merge segments into a single Merge object. The problem is that when there are multiple IndexWriter objects, different ConcurrentMergeScheduler objects are initiated and all of them blindly use available compute resources for the running JVM, without regard to each other. This causes excessive resources (RAM, CPU cores, and I/O resources) usage, way beyond what the user have allocated for merging.

There has to be one MultiTenantConcurrentMergeScheduler object that organizes how all ConcurrentMergeScheduler objects operate and divide resources wisely across them. It should handle addition and deletion of ConcurrentMergeScheduler objects on the go, optimally without the need to restart all ConcurrentMergeScheduler objects every time the number of ConcurrentMergeScheduler objects changes.

Thinking out loud:

Maybe we can use setMaxMergesAndThreads inside the singleton MultiTenantConcurrentMergeScheduler object while merges are happening across all ConcurrentMergeScheduler objects. This update can happen whenever a new ConcurrentMergeScheduler is added or deleted. It should wisely divide the allocated resources across all active ConcurrentMergeScheduler objects, giving more merge threads to needy ConcurrentMergeScheduler objects and less to no threads at all to the idle ConcurrentMergeScheduler objects. We have to come up with an efficient way to decide how to distribute threads based on (1) the continuously changing needs of each ConcurrentMergeScheduler object and (2) number of active ConcurrentMergeScheduler objects.

Jun 16 '25 22:06 yaser-aj

Thanks @yaser-aj , happy to see progress on this project.

You're on the right track with understanding the problem. We want CMS to be aware of merge demands across IndexWriters, so that it can allocate/throttle merge threads more optimally.

__

There has to be one MultiTenantConcurrentMergeScheduler object that organizes how all ConcurrentMergeScheduler objects operate and divide resources wisely across them. It should handle addition and deletion of ConcurrentMergeScheduler objects on the go, optimally without the need to restart all ConcurrentMergeScheduler objects every time the number of ConcurrentMergeScheduler objects changes.

This sounds like a CMS "Manager" that manages multiple other merge schedulers. It might be more effective to change the merge scheduler itself to be multi-tenant. This MultiTenantConcurrentMergeScheduler will consider all the active merges across all index writers whenever merges need to be throttled (see ConcurrentMergeScheduler#maybeStall, ConcurrentMergeScheduler#updateIOThrottle, ConcurrentMergeScheduler#updateMergeThreads etc). Internally, it would maintain some mapping of index writers -> merges to cleanly handle close for a single index writer without affecting merges for other writers. This would give you more direct control over scheduling merges from across index writers.

We could then maintain a singleton for this "multiTenantCMS". Index writers would acquire an instance of that singleton and register themselves with it.

Jun 17 '25 23:06 vigyasharma

We're currently exploring two possible designs for implementing multi-tenant coordination of merge threads across multiple IndexWriter instances in the same JVM. One option is to introduce a singleton MultiTenantCMSManager that centrally tracks all active ConcurrentMergeScheduler instances and dynamically assigns merge thread budgets based on activity. This approach allows each CMS to remain mostly unchanged, delegating coordination to the manager. The second option is to re-architect the ConcurrentMergeScheduler itself to be fully multi-tenant — managing all IndexWriters and merges internally. We'd appreciate guidance on which direction better aligns with Lucene's architecture and long-term maintainability goals. In the meantime, we will be starting on the first option…

Jun 29 '25 16:06 lukewilner

I think both of them are viable options with pros and cons.

With the MultiTenantCMSManager approach, you can probably avoid making deep changes to existing CMS logic. This leaves users with the flexibility of keeping a dedicated CMS for their high activity index writers. CMS already has hooks to update merge threads and IO throttling, which can be used here. One crucial piece, is that manager needs to be aware of active and pending merges across all registered CMS objects. This means you need some mechanism to update the manager whenever pending/active merges change.

A downside with option 1 could be limited control on merge scheduling. The most you can do is reduce merge threads to 1. What happens in machines with few cores, say 4 cores but 8 index writers..? Although, this is no worse from the setup today where the application has implicitly allocated more threads than cores to merging (because each CMS is unaware of other merging activity).

In the meantime, we will be starting on the first option…

+1, I would encourage you to do the same. We can always pivot to option 2 later if needed. Feel free to share early an draft PR and we can brainstorm on code structure. I'd love to see how you're thinking of modeling manager – CMS interactions.

Jun 29 '25 21:06 vigyasharma

A downside with option 1 could be limited control on merge scheduling.

Maybe with option 1 we could add a package private API that allows the manager to drop threads to 0 in the "8 indexes on 4 cores" starvation case?

+1 to start with option 1 (separate CMS manager), and iterate/pivot later if needed.

Ideally in the happy path (merge demand/requests is less than the allowed maximum concurrent merges), all writers would work as they do today, where every merge runs. The unhappy path (more merge demand that allowed capacity) is where it gets more interesting how to share resources... CMS today allows N merges to be requested, and up to M (M <= N) running at once, and has the ability to pause a merge thread when we are between M and N merges. Ideally, the manager can just build on top of CMS, e.g. simply allocating the M and N that it was told to the M and N for each CMS it is managing, changing as merge demand comes and goes on the writers.

Jun 30 '25 15:06 mikemccand

It looks like both options 1 and 2 try to dynamically update the number of merge threads based on multi-tenancy, option 1 does it from the outside via a manager, and option 2 from inside CMS. These 2 options sound quite complicated to me.

I'd suggest creating a fresh new MergeScheduler that would schedule merges in a shared thread pool for all IndexWriters? Merging activity would be controlled by the size of the thread pool (which is akin to CMS's maxThreadCount) and the max size of the queue (akin to CMS's maxMergeCount-maxThreadCount).

It's simpler to me because the total merge activity is naturally capped globally, instead of requiring N-N feedback loops across ConcurrentMergeScheduler instances. It would also probably better scale with the number of IndexWriters?

Jun 30 '25 16:06 jpountz

I'd suggest creating a fresh new MergeScheduler that would schedule merges in a shared thread pool for all IndexWriters? Merging activity would be controlled by the size of the thread pool (which is akin to CMS's maxThreadCount) and the max size of the queue (akin to CMS's maxMergeCount-maxThreadCount).

So the multi-tenant CMS will schedule/throttle merges across all registered writers, agnostic of which writer they came from? I like the simplicity of this approach. We will still need to maintain a merge -> writer mapping, and cleanly handle index writer close(), but that is not too complex.

Would we run the risk of accidentally starving merges from a writer because smaller merges from other writers got preferred? Maybe it's not an issue if merges are small enough.? Also, IIUC, there is a back pressure on indexing if there are too many pending merges. In this setup, would we uniformly throttle all the writers?

Jul 01 '25 07:07 vigyasharma

I have similar questions, but I think we can figure them out? We have options like running merges in the indexing thread (a la SerialMergeScheduler) when that would help.

IMO this approach potentially gives other benefits, e.g. imagine that the multi-tenant merge scheduler could be provided with the same thread pool that is used for indexing. This would give better control over the overall resources used by indexing, vs. today where applications like Solr/Elasticsearc/OpenSearch use a fixed thread pool for indexing but then CMS can create many threads on top of that, wouldn't it be much better if one could easily use the same thread pool across indexing, flushing and merging?

Jul 01 '25 09:07 jpountz

I have similar questions, but I think we can figure them out?

I see, then this is similar to what I was trying to suggest here. It makes sense to first write something simple that does writer agnostic throttling, and later add support for allocations across writers.

Jul 03 '25 00:07 vigyasharma

Yes, this is the same idea indeed! I had thumbs-up'ed it. :) But then the discussion went back to iterating on CMS, which feels both more complicated (N-N feedback loops between the various CMS instances) and less promising (still cannot use the same thread pool for indexing and merging) to me.

FWIW I think we'll need to evaluate whether we actually need throttling to be writer-aware, I anticipate writer-agnostic throttling to go a long way. Only doing simple things such as:

Using the same thread pool for indexing and merging. This way if the thread pool gets full of merges, this will naturally push back on indexing.
If the thread pool cannot accept a new merge, run the smallest queued merge in the current (indexing) thread like SerialMergeScheduler (since merge schedulers should not ignore merges).

Jul 03 '25 11:07 jpountz

Using the same thread pool for indexing and merging. This way if the thread pool gets full of merges, this will naturally push back on indexing.

+1 to this - we have a problem today where force-merge-deletes runs way too long, blocking additional merges, but indexing continues, and we see deletions and overall index size continually growing; it's unhealthy, and maybe back-pressure on indexing would help. OTOH we may just be asking force-merge to do too much ...

Jul 03 '25 12:07 msokolov

OK I like these tradeoffs -- +1 to a new MergeScheduler with a fixed thread pool, and starting simple (no intelligence about being "fair" when writers are asking for too much merging overall). And applying backpressure to indexing threads (that keep producing more work for merging (the newly flushed segments)) when merging cannot keep up.

There is also the thread pool for concurrent HNSW merging (within a single merge) -- apps could also use the same thread pool (as indexing and merging) for that, maybe.

Jul 05 '25 18:07 mikemccand

Using the same thread pool for indexing and merging. This way if the thread pool gets full of merges, this will naturally push back on indexing.

+1 to this - we have a problem today where force-merge-deletes runs way too long, blocking additional merges, but indexing continues, and we see deletions and overall index size continually growing; it's unhealthy, and maybe back-pressure on indexing would help. OTOH we may just be asking force-merge to do too much ...

Well this is sort of a self inflicted wound (at Amazon product search) :)

CMS will already apply backpressure (stall indexing threads that keep writing new segments) when there is a backlog of merges. It has setMaxMergesAndThreads for this, with two settings. First setting (maxThreadCount) says how many merge threads can run at once, second setting (maxMergeCount) limits the max count of merges that need execution (maxMergeCount >= maxThreadCount). As @jpountz describes, it's like a fixed sized queue (maxMergeCount - maxThreadCount), and any attempted merge beyond that will block ongoing indexing until merges catch up.

"We" (Amazon product search) set maxMergeCount and maxThreadCount way too high (100 I think!), allowing basically unbounded merge backlog and no indexing backpressure. So let this be a warning to Lucene users! Don't blindly undo CMS's limits ...

Jul 05 '25 18:07 mikemccand

lucene lucene copied to clipboard

A multi-tenant ConcurrentMergeScheduler

Description

lucene
lucene copied to clipboard