lucene icon indicating copy to clipboard operation
lucene copied to clipboard

[WIP] Multi-Vector support for HNSW search

Open vigyasharma opened this issue 1 year ago • 55 comments

Adds support for multi-valued vectors to Lucene.

In addition to max-similarity aggregations like parent-block joins, this change supports ColBERT style distance functions that compute interaction across all query and document vector values. Documents can have a variable number of vector values, but to support distance function computations, we require all values to have the same dimension.

This is a big change and I still need to work on tests (existing and new), backward compatibility, benchmarks and some code refactor/cleanup. Raising this early version to get feedback on the overall approach. I marked the PR with no commit tags.

Addresses #12313 .

.

Approach

We define a new "Tensor" field that comprises multiple vector values, and a new TensorSimilarityFunction to compute distance across multiple vectors (uses SumMax() currently). Node ordinal is assigned to the tensor value, giving us one ordinal per document. All vector values of a tensor field are processed together during writing, reading and scoring. They are passed around as a packed float[] or byte[] array with all vector values concatenated. Consumers (like the TensorSimilarityFunction) slice this array by dimension to get individual vector values.

Tensors are stored using a new FlatVectorStorage that supports writing/reading variable length values per field (allowing us to have a different number of vectors per tensor). We reuse the existing HNSW readers and writers. Each graph node is a tensor and maps to a single document. I also added a new codec tensor format, to allow both tensors and vectors to coexist. I'm not yet sure how to integrate with the quantization changes (separate later date change) and didn't want to force everything into a single format. Tensors continue to work with KnnVectorWriter/Reader and extend the FlatVectorWriter/Reader classes.

Finally, I named the field and format "Tensors" though technically these are only rank-2 tensors. The thought was that we might extend this field and format if we ever went for higher rank tensors support. I'm open to renaming based on community feedback.

.

Major Changes

The PR has a lot of files which is not practical to review. Here are the files with key changes. If we align on the approach, I'm happy to reraise separate split PRs with different changes.

  1. New fields and similarity function for tensors.
    1. lucene/core/src/java/org/apache/lucene/document/FieldType.java
    2. lucene/core/src/java/org/apache/lucene/document/KnnByteTensorField.java
    3. lucene/core/src/java/org/apache/lucene/util/ByteTensorValue.java
    4. lucene/core/src/java/org/apache/lucene/document/KnnFloatTensorField.java
    5. lucene/core/src/java/org/apache/lucene/util/FloatTensorValue.java
    6. lucene/core/src/java/org/apache/lucene/index/TensorSimilarityFunction.java
    7. lucene/core/src/java/org/apache/lucene/index/FieldInfo.java
    8. lucene/core/src/java/org/apache/lucene/index/FieldInfos.java
  2. Indexing chain changes
    1. lucene/core/src/java/org/apache/lucene/index/IndexingChain.java
    2. lucene/core/src/java/org/apache/lucene/index/VectorValuesConsumer.java
  3. Reader side changes to return a tensor reader for tensor fields
    1. lucene/core/src/java/org/apache/lucene/index/SegmentCoreReaders.java
    2. lucene/core/src/java/org/apache/lucene/index/SegmentReader.java
  4. A new tensor format in the codec
    1. lucene/core/src/java/org/apache/lucene/codecs/KnnTensorsFormat.java
    2. lucene/core/src/java/org/apache/lucene/index/CodecReader.java
  5. A new tensor scorer to work with multiple vector values
    1. lucene/core/src/java/org/apache/lucene/codecs/hnsw/FlatTensorsScorer.java
    2. lucene/core/src/java/org/apache/lucene/codecs/hnsw/DefaultFlatTensorScorer.java
  6. A Lucene99FlatTensorsWriter for writing in the new flat tensor format - lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatTensorsWriter.java
  7. A Lucene99FlatTensorsReader for reading the flat tensor format - lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatTensorsReader.java
  8. An HnswTensorFormat that uses FlatTensorFormat to initialize the flat storage readers/writers underlying HNSW reader/writer.
    1. lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswTensorsFormat.java
  9. Hnsw reader and writer changes to support tensor fields and similarity function
    1. lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java
    2. lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsWriter.java
  10. Off Heap Byte and FloatTensorValues for use by scorers
    1. lucene/core/src/java/org/apache/lucene/codecs/lucene99/OffHeapByteTensorValues.java
    2. lucene/core/src/java/org/apache/lucene/codecs/lucene99/OffHeapFloatTensorValues.java
  11. Setup to read and write tensor data value offsets to support variable vector count per tensor. This uses a DirectMonotonicReader/Writer.
    1. lucene/core/src/java/org/apache/lucene/codecs/lucene99/TensorDataOffsetsReaderConfiguration.java
  12. Syntax sugar for tensor queries
    1. lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java
    2. lucene/core/src/java/org/apache/lucene/search/KnnByteTensorQuery.java
    3. lucene/core/src/java/org/apache/lucene/search/KnnFloatTensorQuery.java

.

Open Questions

  1. I like the type safety of a separate field and similarityFn. It avoids user traps around passing a single value VectorSimilarity for tensors. But it does add a bunch of extra code, and ctor changes across a lot of files. Some options to simplify could be:
    • Reuse vectorEncoding and vectorDimension attributes in FieldInfo instead of a separate tensor encoding and dimension
    • Also reuse VectorSimilarityFunction, but create a separate "tensor aggregator" that corresponds to SumMax.

vigyasharma avatar Jun 26 '24 19:06 vigyasharma

Thank you @vigyasharma ! Before digging into API design, etc. (I definitely have thoughts there, but will shelve them for now).

I am wondering about the performance, and if recall significantly suffers in the "passage vector" use case.

Cohere's wikipedia embeddings all indicate their parent page. So, I wonder how this would work on finding the nearest page given the maxsim(passage) vs. using the Lucene join logic and putting the passages directly in the graph.

If we can recover recall by simply gathering more docs or by increasing graph connections, it makes this idea very attractive.

benwtrent avatar Jun 27 '24 14:06 benwtrent

Thanks for looking into this @benwtrent! I did some cleanup on FieldType and similarity function interfaces, hope it doesn't cause too much churn to what you've gone through so far.

We now use existing vector encoding, dimension and similarity; and add only the new info needed to FieldType - the aggregation fn, and a flag to differentiate vector/tensor - I like this abstraction better.

Benchmarking and writing tests are next up on my list. The Cohere "passage vector" use case seems like a valuable exploration, thanks for pointing that out. The baseline comparison with Lucene joins for maxSim is a good first test, although, it's also worth noting that this change offers additional functionality over joins. It enables ColBERT-style late interactions, where both queries and documents can have multiple vector values.

vigyasharma avatar Jun 28 '24 00:06 vigyasharma

The baseline comparison with Lucene joins for maxSim is a good first test, although, it's also worth noting that this change offers additional functionality over joins. It enables ColBERT-style late interactions, where both queries and documents can have multiple vector values.

I recognize that. I am just thinking of the flexibility Lucene provides.

As an aside, at first blush, this seems way too complicated. I honestly would have expected a

  • some format changes to the EXISTING to allow for more than one vector & an option on how to handle more than one
  • a change to [Float|Byte]VectorValues to allow iterating individual vectors (maybe a "multi" vector values interface)

I am against adding a entirely new format type. It seems like we should extend the existing one.

benwtrent avatar Jun 28 '24 11:06 benwtrent

at first blush, this seems way too complicated.

Happy to simplify it, that's indeed the idea behind this early draft. For my understanding, what are your reservations with a new format type? I have a sense but would like to know your perspective..

By extending the existing format, I assume you mean not having a separate KnnTensorsFormat in the codec. It gave me the convenience of initializing FlatVectorsFormat with a TensorScorer. To extend KnnVectorsFormat, I could probably modify the FlatVectorScorer to include tensor fns (they need the aggregation fn). On the storage side, tensors have variable length vector arrays and need to read/write their start offsets. I could update the existing FlatVectorsFormat and write these data offsets only for when the field is a tensor.

How does that sound? I am thinking of keeping it similar to the ScalarQuantization change with separate HnswReader|Writer and FlatVectorReader|Writer classes that modify the existing format but are not yet the codec defaults. (As an aside, I haven't gone through how Scalar Quantization works yet, so I'll make the change without it, and we can subsequently iterate to include it).

While we're at it, I feel that the name "tensor" is less fitting without the separate format. I'm considering renaming to "multi-vectors", thoughts?

vigyasharma avatar Jun 30 '24 06:06 vigyasharma

a change to [Float|Byte]VectorValues to allow iterating individual vectors (maybe a "multi" vector values interface)

The work to handle multiple vector values gets taken care of in the similarity function. So we don't really have to change the [Float|Byte]VectorValues, and RandomAccessVectorValues interfaces. We still pass around a T[] vector value, just that it can be of variable length (depending on no. of vectors). The similarity fn slices it by dimension before comparing.

I feel this fits conveniently with a lot of our existing interfaces. Do you see a specific need for [Float|Byte]VectorValues to iterate on individual vector values instead?

vigyasharma avatar Jun 30 '24 06:06 vigyasharma

I could update the existing FlatVectorsFormat and write these data offsets only for when the field is a tensor.

I was thinking something like this. We should dynamically handle if more than one vector is provided. Having to configure up front "Hey, more than one vector is incoming" is weird. Why would anybody ever configure the "single vector case" as the multi-vector case would also just handle the single vector one. Seems like we don't need specialized formats but should instead update the current flat vector formats.

I feel this fits conveniently with a lot of our existing interfaces. Do you see a specific need for [Float|Byte]VectorValues to iterate on individual vector values instead?

Yes, we need to be able to iterate vectors via doc Ids and gather each individual vector for a given document.

Three concerns immediately come to mind:

  • rescoring docs that are gathered via some quantized methodology
  • Determining the true nearest vector for a given document.
  • Ability to iterate vectors when quantizing them. We need randomly sample across all vectors. We don't want to sample via docs ids, this will likely add bias and hurt the quantization quality.

As for adding new information to the FieldInfo, another valid option is making it configurable directly on the format and not update fieldinfo. I am not sure its valuable to have it in fieldinfo. I wouldn't expect the useages for how to resolve the multi-vector scoring to be as broad as our similarity functions or vector dimensions.

benwtrent avatar Jul 01 '24 13:07 benwtrent

Merged in latest origin/main to resolve the merge conflict that arose from #13529 - hope that's okay.

cpoerschke avatar Jul 01 '24 17:07 cpoerschke

we need to be able to iterate vectors via doc Ids and gather each individual vector for a given document.

Could the consumers of T[] vectorValue() unpack the multiple vector values for a doc when they need to? We could add an iterator method for this in Float|ByteVectorValues for convenience.. do you have an interface in mind?

I imagine the DISI would still advance to the next document in nextDoc(), and this would be a separate method, like nextValue() which iterates on the packed vector value, or optionally calls nextDoc() when needed?

As for adding new information to the FieldInfo, another valid option is making it configurable directly on the format and not update fieldinfo.

I like the flexibility of switching from single to multi-values. Was thinking of adding it in a subsequent iteration. Re: not keeping state in FieldInfo, my worry is with the aggregation function. Will we be able to ensure users don't change the aggregation fn midway through indexing their corpus, if this state is only within the format and FieldReader/Writers... I need to think about it in more detail.

vigyasharma avatar Jul 01 '24 17:07 vigyasharma

Merged in latest origin/main to resolve the merge conflict that arose from #13529 - hope that's okay.

That's perfect @cpoerschke , and thank you for looking at this change!

vigyasharma avatar Jul 01 '24 17:07 vigyasharma

I imagine the DISI would still advance to the next document in nextDoc(), and this would be a separate method, like nextValue() which iterates on the packed vector value, or optionally calls nextDoc() when needed?

Yes, something like that exactly.

Will we be able to ensure users don't change the aggregation fn midway through indexing their corpus, if this state is only within the format and FieldReader/Writers... I need to think about it in more detail.

I am not sure of a fool-proof limitation other than ensuring everything is the same on merge. But even then, I am not 100% sure that is even necessary. If the score aggregation changes on merge, we just change it. I am not sure its an issue.

benwtrent avatar Jul 01 '24 18:07 benwtrent

Did some refactor to reuse the same format. Also renamed "tensor" to "multi-vector"..

vigyasharma avatar Jul 04 '24 02:07 vigyasharma

Diff is down to 45 files now (from 75, yikes!), and most touch points are still due to FieldInfo changes. These are the files with key changes:

lucene/core/src/java/org/apache/lucene/codecs/hnsw/DefaultFlatMultiVectorScorer.java
lucene/core/src/java/org/apache/lucene/codecs/hnsw/FlatVectorsScorer.java

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatMultiVectorsFormat.java
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatMultiVectorsReader.java
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatMultiVectorsWriter.java
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswMultiVectorsFormat.java
lucene/core/src/java/org/apache/lucene/codecs/lucene95/OrdToDocDISIReaderConfiguration.java
lucene/core/src/java/org/apache/lucene/codecs/lucene99/MultiVectorDataOffsetsReaderConfiguration.java

lucene/core/src/java/org/apache/lucene/codecs/lucene99/OffHeapByteMultiVectorValues.java
lucene/core/src/java/org/apache/lucene/codecs/lucene99/OffHeapFloatMultiVectorValues.java

lucene/core/src/java/org/apache/lucene/index/FieldInfo.java
lucene/core/src/java/org/apache/lucene/index/FieldInfos.java
lucene/core/src/java/org/apache/lucene/index/IndexableFieldType.java
lucene/core/src/java/org/apache/lucene/document/FieldType.java
lucene/core/src/java/org/apache/lucene/util/ByteMultiVectorValue.java
lucene/core/src/java/org/apache/lucene/util/FloatMultiVectorValue.java
lucene/core/src/java/org/apache/lucene/document/KnnByteMultiVectorField.java
lucene/core/src/java/org/apache/lucene/document/KnnFloatMultiVectorField.java

lucene/core/src/java/org/apache/lucene/index/IndexingChain.java
lucene/core/src/java/org/apache/lucene/index/MultiVectorSimilarity.java
lucene/core/src/java/org/apache/lucene/index/MultiVectorSimilarityFunction.java

vigyasharma avatar Jul 04 '24 02:07 vigyasharma

As for adding new information to the FieldInfo, another valid option is making it configurable directly on the format and not update fieldinfo.

We create multi-vector FieldWriters differently for flat format as well as hnsw graph builds. In its current form, we'll have to change these objects midway during indexing if users switch from single to multi-vectors. Maybe we can defer multi-vector specific functionality to later stages, like right before flush or merge.. but I need to think on this more deeply before we can move away from keeping multi-vector attributes in FieldInfo.

For now, I've left it as is while I shift focus to benchmarking and testing. We can come back to this later.

vigyasharma avatar Jul 06 '24 02:07 vigyasharma

@vigyasharma is there a reason to adding the multi vector field support and not use the parent child relationship of the documents to fulfill this use case?

navneet1v avatar Jul 16 '24 23:07 navneet1v

@navneet1v

The pattern doesn't work well with ColBERT esque models.

benwtrent avatar Jul 17 '24 00:07 benwtrent

The pattern doesn't work well with ColBERT esque models.

+1.. Good question, @navneet1v. I had the same doubts before starting this effort. There is some discussion in 12313.

Basically, Multi-vectors allow for ColBERT style late interaction models, where queries and documents can both be represented by multiple vector values, and you do a similarity measure using all those values (e.g. sum(max(similarity))). They gather contextual interaction across different parts of a query and a document, without incurring the query-time computational overheads of cross-encoder models.

Nested block joins on the other hand compute distance with a single vector value and then aggregate. That doesn't work well here, esp. with both query and doc multi-vectors.

vigyasharma avatar Jul 18 '24 18:07 vigyasharma

Cohere's wikipedia embeddings all indicate their parent page. So, I wonder how this would work on finding the nearest page given the maxsim(passage) vs. using the Lucene join logic and putting the passages directly in the graph.

@benwtrent : do we have any existing benchmarks for ParentJoin queries in knn?

I didn't find it in KnnGraphTester and the benchmarking code I've looked at so far. I can add support for ParentJoin benchmarks but I see you ran something similar here, so thought I'd check if we already have this available.

vigyasharma avatar Jul 23 '24 00:07 vigyasharma

@vigyasharma

do we have any existing benchmarks for ParentJoin queries in knn?

No, we do not. I ended up writing a bunch of throw away code to benchmark latency and recall :(.

Those particular benchmarks I did with elastic/rally

benwtrent avatar Jul 23 '24 11:07 benwtrent

I started adding support for ParentJoin benchmarks (issue). Will raise it in multiple small PRs, here's the first one.

vigyasharma avatar Jul 24 '24 18:07 vigyasharma

I started adding support for ParentJoin benchmarks (issue). Will raise it in multiple small PRs, here's the first one.

Thank you for improving our benchy tooling @vigyasharma!

mikemccand avatar Jul 25 '24 12:07 mikemccand

I think it's awesome to invest in our benchmarking tooling to be able to test different approaches for multi-valued vectors, but, I don't think that should be a blocker to merging this? It adds functionality over the block-join approach, and it's nice to have options for how to model multi-valued things.

Block joins are powerful if you also have other attributes you'd like to query on at the child level, e.g. "KNN(query-vector) AND price < $50" joined up to parent documents ... but are maybe overkill if you really just want to have multiple vectors without such added per-vector constraints.

mikemccand avatar Jul 25 '24 12:07 mikemccand

I think it's awesome to invest in our benchmarking tooling to be able to test different approaches for multi-valued vectors, but, I don't think that should be a blocker to merging this?

I don't think its a blocker for code to exist in Lucene Util. But we should be aware on how this actually behaves from a recall stand point. Right now, we don't have a clue.

At a minimum, if the expressed requirement for this is to add ColBERT style things, we need to know how this behaves with that.

But, it would also be good to see how this would work with passage search and how recall behaves.

benwtrent avatar Jul 25 '24 13:07 benwtrent

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Aug 09 '24 00:08 github-actions[bot]

This PR has not had activity in the past 2 weeks, labeling it as stale...

Just to update on some activity here: I'm working on parent block join benchmarks in luceneutil, which I'll to use to compare against the passage search use-case for multi-vectors. Had to put it on hold due to other priorities, but I should be able to get back to it next week.

vigyasharma avatar Aug 19 '24 20:08 vigyasharma

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Sep 04 '24 00:09 github-actions[bot]

Changes to support parentJoin benchmarks in luceneutil -- https://github.com/mikemccand/luceneutil/pull/296

Benchmarking results for the passage search use-case using Cohere embeddings for wikipedia (this is for the default parentJoin implementation, without using multi-vectors):

# parent join with quantization
recall  latency (ms)    nDoc    fanout  maxConn beamWidth       quantized       visited index ms        selectivity     filterType
0.158    3.96   253804  20      32      50      4 bits  100     25213   1.00    post-filter
0.162    4.00   253804  20      32      50      7 bits  100     24277   1.00    post-filter
0.015   13.42   253804  20      32      50      8 bits  100     167694  1.00    post-filter

# parentJoin without quantization
recall  latency (ms)    nDoc    fanout  maxConn beamWidth       quantized       visited index ms        selectivity     filterType
0.167    6.22   253804  20      32      50      no      100     27412   1.00    post-filter

# default run (no parentJoin)
recall  latency (ms)    nDoc    fanout  maxConn beamWidth       quantized       visited index ms        selectivity     filterType
0.565    1.89   250000  20      32      50      4 bits  4610    43426   1.00    post-filter
0.820    1.62   250000  20      32      50      7 bits  4198    43593   1.00    post-filter
0.004    3.49   250000  20      32      50      8 bits  9531    86058   1.00    post-filter

vigyasharma avatar Sep 06 '24 22:09 vigyasharma

Is "default run" from this PR?

benwtrent avatar Sep 09 '24 23:09 benwtrent

@msokolov I saw recently you were working on a major refactor where we just make every vector access random access. I think this might make the changes in this PR simpler as we won't have two access methodologies but just a single one.

Though I read that this might not be done until sometime after the Lucene 10 release.

benwtrent avatar Sep 11 '24 17:09 benwtrent

Is "default run" from this PR?

No. "default run" is knn search where each embedding is a separate document with no relationship between them. I'm still wiring things up to see benchmark results for this PR.

vigyasharma avatar Sep 12 '24 18:09 vigyasharma

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Sep 27 '24 00:09 github-actions[bot]