lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Decrease TieredMergePolicy's default number of segments per tier to 8.

Open jpountz opened this issue 6 months ago • 11 comments

TieredMergePolicy currently allows 10 segments per tier. With Lucene being increasingly deployed with separate indexing and search tiers that get updated via segment-based replication, I believe that it would make sense for Lucene to have more aggressive merging defaults, a price that is only paid once on the indexing tier, but that benefits all search nodes that serve queries for this index.

Note that this is still a somewhat conservative default, applications with low latency requirements and low update rates will likely want to go even further, with 4 segments per tier, or even 2.

BaseMergePolicyTestCase#testSimulateAppendOnly reports a write amplification increase from 3.4 to 3.8, while BaseMergePolicyTestCase#testSimulateUpdates reports a write amplification increase from 4.5 to 4.9. In exchange, the number of segments between the floor and max segment sizes decreases by about 20%.

This should especially help queries that have a high per-segment overhead: PK lookups, point queries, multi-term queries and vector searches.

jpountz avatar Jun 20 '25 12:06 jpountz

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

github-actions[bot] avatar Jun 20 '25 12:06 github-actions[bot]

+1, this is a great idea -- more aggessive merging by default makes sense.

mikemccand avatar Jun 20 '25 12:06 mikemccand

Thanks @mikemccand ! I'll wait a few days before merging to give others a chance to take a look.

jpountz avatar Jun 20 '25 21:06 jpountz

My only concern is that it will have a really serious impact on the HNSW indexing/merging.

I know this indexing performance part isn't as fun/sexy as speeding up the queries, but not everyone has thousands of QPS: the queries could probably be slower. It could probably have 1% shittier recall and 10x faster indexing. I'm concerned wrong tradeoffs are made, and I feel like instead a lot of users suffer from the inefficient indexing. This will make that pain worse.

I don't think it should block the change but we should keep an eye on this.

rmuir avatar Jun 24 '25 16:06 rmuir

Here's the trend of HNSW indexing performance in the nightly bench:

Screenshot 2025-06-24 at 1 42 29 PM

You can see why I think wrong tradeoffs are being made. It is trending in the wrong direction and everyone is focused instead on making 1000QPS into 1200QPS or whatever.

rmuir avatar Jun 24 '25 17:06 rmuir

To be fair, this chart suggests a quite dramatic degradation over time, but these big drops are mostly due to the benchmark becoming harder by increasing the number of dimensions of vectors (annotation FK) or changing the model (annotation GT), and there are some speedups hidden among these big drops such as a merging speedup (annotation IE).

That said, I agree with your point, we want Lucene to work well on near-realtime use-cases that have high update rates as well, so we can't afford to make indexing and merging absurdly slow just for the sake of having slightly faster search.

jpountz avatar Jun 25 '25 12:06 jpountz

but these big drops are mostly due to the benchmark becoming harder by increasing the number of dimensions of vectors (annotation FK) or changing the model (annotation GT), and there are some speedups hidden among these big drops such as a merging speedup (annotation IE).

I don't think that statement is accurate. The first big drop is DW. LUCENE-10109: Increase default beam width from 16 to 100.

This one explicitly makes indexing more than 3x slower (from 409 GB/hr -> 140.7 GB/hr). The commit message states "This new default tries to balance good search performance with indexing speed."

Is the balance a good one?

rmuir avatar Jun 25 '25 14:06 rmuir

I'll need @benwtrent or @msokolov to provide an educated answer to this question, but the issue mentions that the previous value also had worse recall.

jpountz avatar Jun 25 '25 20:06 jpountz

The right default value for beam width is a little hard to say. Values reported in public vary wildly. Try asking Gemini what is the right default and it refuses to be pinned down! Unfortunately it depends - on the application setup (does it have a fancy second-stage ranker, or do we rely on the vector scores directly for ranking), on the vector data set (its dimensions, the size of the index). We're finding that "high quality" vectors might require lower values for these exploration parameters, and applications that gather lots of semantic hits and then later re-rank them might be less sensitive to lower recall and can tolerate cheaper indexing settings. You can see that these sources recommend values like 16, 64, 128, and 256:

  • https://opensearch.org/blog/a-practical-guide-to-selecting-hnsw-hyperparameters/
  • https://www.pinecone.io/learn/series/faiss/hnsw/

Maybe a good rule of thumb would be to pick the place where the slope of these latency/recall curves is 45 degrees? But of course that will depend on the units of the chart.

My view is this is something that, at least as a default, we could detune a bit. Maybe we could retreat to 64? But anyway this issue is about merge policy segment counts I think so we should probably open a new issue if we want to change that

msokolov avatar Jun 25 '25 21:06 msokolov

Yeah I'm just commenting on overall indexing-vs-search speed perf. As I mentioned above, it isn't intended to block this issue at all: I just hope we are working to have good defaults that take indexing speed into consideration, as it is very slow for HNSW: I imagine this issue will make that even worse.

rmuir avatar Jun 25 '25 22:06 rmuir

Increase default beam width from 16 to 100.

16 is a tiny ef_construction/beamWidth. I would expect graph quality to be pretty poor once reaching a non-trivial amount of vectors.

I would argue the previous numbers (tiny beamwidth or tiny vectors), don't reflect a real use-case.

there are some speedups hidden among these big drops such as a merging speedup (annotation IE).

While we have improved merge times, there is likely more ground to be recovered in making overall HNSW indexing speed better. My gut still tells me we are doing something silly in our diversity checking, doing too much unnecessary work in building the HNSW graph...

benwtrent avatar Jun 26 '25 13:06 benwtrent