lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Integrate a JVector codec for KNN searches

Open RKSPD opened this issue 7 months ago • 17 comments

JVector (https://github.com/datastax/jvector) is "a graph-based index that builds on the HNSW and DiskANN designs with composable extensions."

JVector features a DiskANN implementation and allows for multi-phase vector matching and disk-based rerank. The need for JVector as a search engine is neatly summarized in this OpenSearch issue.

Proposing a new codec in Lucene as a standalone KnnVectorsFormat that is based on OpenSearch's implementation of JVector. This implementation would integrate with Lucene's existing vector APIs and codec SPI.

Opening this issue to get feedback on the need, implementation, and long term considerations for this codec.

RKSPD avatar May 16 '25 22:05 RKSPD

+1, it'd be awesome to refactor OpenSearch's jvector integration down to Lucene as an alternative Codec (KnnVectorsFormat) component in sandbox.

https://github.com/apache/lucene/issues/12615 has lots of good discussion about jvector/DiskANN as well.

https://github.com/apache/lucene/pull/14009 might also be important since it adds multi-phased KNN query support.

mikemccand avatar May 20 '25 12:05 mikemccand

sandbox

Nit: it's fine for any codec to live in lucene/codecs in my opinion, the bar isn't much higher than sandbox, and this allows us to put them into codec randomization for instance.

jpountz avatar May 21 '25 12:05 jpountz

This seems worth exploring. One question I have, after reading the linked discussion thread about integrating JVector as an OpenSearch plugin, is about this claim ("JVector is a threadsafe index that supports concurrent modification and inserts with near perfect scalability as you add cores, Lucene is not threadsafe"). Could you elaborate why you think Lucene is not threadsafe? Will this mismatch present some obstacle to integrating JVector?

msokolov avatar May 22 '25 19:05 msokolov

The biggest roadblock to integrating properly with Lucene is that jVector throughout relies on a RandomWriter that can seek backwards. This is not compatible with Lucene's append only interfaces. As a result, we are now adding support for append only writer within jVector that is compatible with Lucene. Once it's there I think that the integration with Lucene will be much cleaner and we won't have to carry a lot of the complexity that is currently in the code of the opensearch plugin.

For reference: https://github.com/datastax/jvector/pull/475

Could you elaborate why you think Lucene is not threadsafe? Will this mismatch present some obstacle to integrating JVector?

Not sure about the context in which the comment was made. But I think it's referring to jVector's reliance on various ForkJoinPools to build a single segment of an index (not just during merge but all the time). Not sure what the assumptions it was making about Lucene, perhaps about the per thread nature when writing new Lucene segments. I noticed that the Lucene99HnswVectorsWriter implementation takes TaskExecutor mergeExec to facilitate faster merges, but I haven't seen something similar to speed up the building of a single segment when reading flat vectors format from a source.

EDIT: Small update, the PR is merged so now the road to smoother integration is open :) https://github.com/datastax/jvector/pull/475

sam-herman avatar Jun 12 '25 17:06 sam-herman

I have a working JVector implementation for Lucene ported over from the opensearch implementation and I have benchmarks for that version. There are issues like pre-cached exact vector mismatch between runs, etc that have been addressed in newer update. I'm working on incorporating the new changes, but the codebase is moving very quickly. What features are ready to be incorporated into Lucene JVector on opensearch-jvector's side, and aside from the feature requests on opensearch, is it safe to consider this latest version as a checkpoint for implementing the JVector codec in Lucene?

UPDATE:

Lucene JVector Codec now up to date with June 13 build of OpenSearch-JVector codec. Currently benchmarking

RKSPD avatar Jun 16 '25 20:06 RKSPD

Thanks for opening this and for all the work going into expanding vector search in Lucene.

That said, I’m a bit concerned about the growing number of options being proposed that seem to have similar trade-offs, just implemented differently. We’ve added Faiss recently, and now there’s talk of bringing in another graph-based method that doesn’t really change the fundamental limitations we already have with the HNSW codec.

Things like Vamana or product quantization can definitely be useful, but from what I can tell, they offer similar properties to what we already have with HNSW + BBQ. It feels like we’re piling on more alternatives without really addressing the pain points some users face with graph-based indexing in general.

Also, calling this a “disk-based” solution seems a bit misleading if the graph still has to be built fully in memory. That’s often the core problem people are trying to get around.

Instead, I think it might be more valuable to dig into the specific improvements JVector brings and see if any of those could make our current HNSW implementation better. I thought that was already explored in #12615, and as I recall, there wasn’t a strong enough differentiator to justify pulling in the full JVector approach. Revisiting that might be a better path forward.

I totally get that OpenSearch wants to contribute their work upstream, and that’s great to see. It just feels important that we avoid overwhelming users with too many similar graph-based options, especially when they don’t clearly solve new problems.

Appreciate the ongoing discussion and all the effort here!

jimczi avatar Jun 19 '25 11:06 jimczi

I have a working JVector implementation for Lucene ported over from the opensearch implementation and I have benchmarks for that version. There are issues like pre-cached exact vector mismatch between runs, etc that have been addressed in newer update. I'm working on incorporating the new changes, but the codebase is moving very quickly. What features are ready to be incorporated into Lucene JVector on opensearch-jvector's side, and aside from the feature requests on opensearch, is it safe to consider this latest version as a checkpoint for implementing the JVector codec in Lucene?

UPDATE:

Lucene JVector Codec now up to date with June 13 build of OpenSearch-JVector codec. Currently benchmarking

@RKSPD thank you very much for taking the initiative on this one! I believe it should get a lot more stable right now. Especially since the latest plugin release as it fixed some of the core integration issues mentioned earlier. (Sequential vs Random writes) With that being said, it is a fair question to ask regarding roadmap and timeline of changes. There is a collection of issues right now here but I agree that there is room for improvement around the predictability of the roadmap. I'll try to aggregate those under some meta issue as a first step to help with that. Hope that answers your question and help unblock you on this initiative as much as possible.

sam-herman avatar Jun 21 '25 19:06 sam-herman

Things like Vamana or product quantization can definitely be useful, but from what I can tell, they offer similar properties to what we already have with HNSW + BBQ.

jVector is actually a combination of both Vamana style and HNSW hierarchy combined in the same graph with the option of using PQ or BQ construction/search. In terms of properties there are some unique functionalities, to name a few:

  1. NVQ (Non-Linear vector Quantization) - on-disk vector format which we will add to the plugin pretty soon as well. It exists in jVector but we haven't made the integration to the plugin just yet, but coming pretty soon.
  2. Inline vectors - Both separate (NVQ or FP) and inline storage formats for vectors that improve IO patterns.
  3. Concurrency - jVector allows for concurrent build of graph index

Note: there are upcoming additional differentiators that will be released soon as well, and can update once those are released.

Also, calling this a “disk-based” solution seems a bit misleading if the graph still has to be built fully in memory. That’s often the core problem people are trying to get around.

if the FP vectors are not stored in memory we have noticed that the graph structure is overall pretty lean and can fit on the JVM heap pretty easily, even on low heaps. Moreover, another important factor is that of efficient IO access for scoring FP vectors even for the non quantized use cases. A few months back I noticed significantly more IO access in the Lucene HNSW format than the jVector version at the time.

Instead, I think it might be more valuable to dig into the specific improvements JVector brings and see if any of those could make our current HNSW implementation better. I thought that was already explored in https://github.com/apache/lucene/issues/12615, and as I recall, there wasn’t a strong enough differentiator to justify pulling in the full JVector approach. Revisiting that might be a better path forward.

I think this is not much different than FAISS integration in the sense that it is possible to try and copy all the code over to Lucene HNSW implementation. That however is not always possible and even when it is, it can create maintenance difficulties. Consider the scenario of DataStax that uses jVector in a number of projects such as C* and OpenSearch. Ideally we would not want to maintain copies of the same code in various projects and rather have an easy path to leverage the innovation across projects. Integration of the jVector codec to Lucene helps with the problem of portability, as it doesn't require the overhead of maintaining OpenSearch plugin and allows for a more seamless integration. At the same time it allows to shift more resources to innovation without blocking or hindering other approaches in other codecs.

sam-herman avatar Jun 21 '25 20:06 sam-herman

if the FP vectors are not stored in memory we have noticed that the graph structure is overall pretty lean and can fit on the JVM heap pretty easily, even on low heaps.

This is true for quantized HNSW as well.

Moreover, another important factor is that of efficient IO access for scoring FP vectors even for the non quantized use cases.

What are the improvements there? Is it adjusting the IO patterns for scoring or is it because rescoring utilizes LVQ? (e.g. scalar quantization centered on a centroid...which is a simpler version of the OptimizedScalarQuantizer in Lucene right now)

Concurrency - jVector allows for concurrent build of graph index

So does Lucene.

A few months back I noticed significantly more IO access in the Lucene HNSW format than the jVector version at the time.

I would be happy to see those numbers.

I do think there are things that Lucene can learn from JVector. It would be way better for the Lucene community as a whole to "do the hard thing" and improve Lucene directly instead of adding yet another plugin that does yet another graph based vector search impl.

If we see that JVector is just better than quantized HNSW in every way, why not simply sunset HNSW and switch to vamana?

benwtrent avatar Jun 23 '25 12:06 benwtrent

I do think there are things that Lucene can learn from JVector. It would be way better for the Lucene community as a whole to "do the hard thing" and improve Lucene directly instead of adding yet another plugin that does yet another graph based vector search impl.

It's not an either-or; it's both.

dsmiley avatar Jun 23 '25 13:06 dsmiley

It's not an either-or; it's both.

The integration may better highlight how Lucene core can be improved as it allows a more "apples to apples" comparison for performance, etc.

I am just trying to express the direction I think Lucene should go in. Which is making core better (and thus helping typical users).

3 different graph based indices that users can grab (yes, two will be sandbox) is too much.

We don't want Apache Lucene to turn into FAISS, where you have an extremely complicated index customization system with many hundreds of index types, etc.

benwtrent avatar Jun 23 '25 14:06 benwtrent

Non-default codecs don't have the high support burden that the default codec has, in terms of backwards compatibility and general documentation expectations. Few users will choose it. If we have a graph index that seems to offer no compelling value, it should be removed. I doubt Lucene will have 3 different ones long term. Today, I think we should be encouraging more flowers to bloom in this new/exciting field.

dsmiley avatar Jun 23 '25 19:06 dsmiley

Concurrency - jVector allows for concurrent build of graph index

So does Lucene.

@benwtrent I may have missed it, but does Lucene build a single graph in multi threaded manner? I have only seen multi threading in the context of merge for Lucene99HnswVectorsWriter but couldn't see a multi threaded construction of single graph. If this indeed is the case do you mind pointing me to the right place to examine the implementation?

sam-herman avatar Jun 25 '25 18:06 sam-herman

@sam-herman flush is indeed single threaded.

I was mentioning this in the context of merge. Which is the only case in which its necessary as the initial size of segments is tiny when compared with merged segments.

benwtrent avatar Jun 25 '25 18:06 benwtrent

@sam-herman flush is indeed single threaded.

I was mentioning this in the context of merge. Which is the only case in which its necessary as the initial size of segments is tiny when compared with merged segments.

@benwtrent Thank you for confirming my understanding! Whether doing this through jVector codec or if we want to add it to default Lucene HNSW codec, I think multi threaded construction of graph is quite useful for a number of use cases:

  1. Agility - Testing and experimentation is much faster when we want to isolate learning regarding graph codec over 1B+ vectors on a single graph.
  2. Offline build - We can't always assume graph buildup is done using pre-existing smaller graphs, there are scenarios in which customers would like to build large (1B+ vector count) graphs offline directly from the data and later re-import to the datastore.

sam-herman avatar Jun 25 '25 18:06 sam-herman

Also, calling this a “disk-based” solution seems a bit misleading if the graph still has to be built fully in memory. That’s often the core problem people are trying to get around.

JVector does not store the graph in memory. It stores the vast majority of graph on disk. It uses a mixed approach between HNSW and DiskANN, where the top layers of the HNSW-style hierarchy are stored in memory and the base layer is a DiskANN-style graph stored on disk. Note that in HNSW, the upper layers are usually a very small fraction of the overall footprint (~3% for a max-degree of 32).

As you say, in the current implementation, the graph is built in memory and then written to disk. We are currently working towards removing this constraint.

marianotepper avatar Jun 25 '25 18:06 marianotepper

What are the improvements there? Is it adjusting the IO patterns for scoring or is it because rescoring utilizes LVQ? (e.g. scalar quantization centered on a centroid...which is a simpler version of the OptimizedScalarQuantizer in Lucene right now)

We are using NVQ, a non-uniform vector quantizer that introduces nonlinear quantization and uses subvectors (divide each vector in chunks). It is somewhat similar in spirit to the OptimizedScalarQuantizer but instead of using a piecewise linear scheme to achieve a nonlinearity, it uses a nonlinearity directly. A paper describing the approach is coming soon.

marianotepper avatar Jun 25 '25 18:06 marianotepper

Hi everyone! I just opened the Lucene JVector #14892 Thanks everyone for all the comments and I would really appreciate people checking it out! I'd appreciate any advice / suggestions for improvement.

RKSPD avatar Jul 02 '25 18:07 RKSPD