lucene Deprecate `COSINE` before Lucene 10 release

Description

Over a couple disparate discussions over various PRs/issues, we have tossed around the idea of deprecating COSINE.

To me, this makes sense, we shouldn't have COSINE at all IMO. Either users should normalize before indexing, or use max-inner product.

The question then becomes "What about byte vectors"? It seems to me that users should still use DOT_PRODUCT and MAX_INNER_PRODUCT.

Apr 08 '24 12:04 benwtrent

@benwtrent Is the main reason to deprecate to stop enabling users to setup non-optimal configurations? Or are there limitations cosine similarity imposes on implementation/optimization for other distance metrics?

Apr 16 '24 15:04 jmazanec15

@jmazanec15

Mainly because cosine has no benefit over normalizing & using dot_product and maintaining optimized Cosine similarity functions is an unnecessary burden for Lucene.

Apr 16 '24 15:04 benwtrent

Thanks @benwtrent that makes sense

Apr 16 '24 17:04 jmazanec15

One question: the float vector has a straightforward definition of "normalized to 1.0", i.e. ||a|| = 1, e.g. (-1.0, 0.0) and (0.0, 1.0) are both normalized to 1. In comparison, do we require byte vectors to be normalized to 127, like (-127, 0) and (0, 127) are normalized?

Jun 28 '24 05:06 wurui90

do we require byte vectors to be normalized to 127, like (-127, 0) and (0, 127) are normalized?

No, we do not. For regular dot-product, their magnitudes should be the same.

Jun 28 '24 11:06 benwtrent

@wurui90 I need a think a bit more about your question. Admittedly, I was thinking it would be rather simple for users who quantize their vectors to ensure they are quantizing from normalized vectors, thus, magnitude doesn't matter.

Maybe we cannot deprecate cosine :(. The deprecation currently exists in 10 & 9.12, but can be easily reverted.

I am not sure what to do for users who quantize their own vectors & rely on cosine.

@msokolov what do you think?

I am wondering if we should, at a minimum, deprecate the cosine functions, and instead store the magnitude, this way the comparisons for the "COSINE" similarity metric are actually the dot-product ones, and the magnitude is stored to adjust the resulting dot-product (thus all that code can go away).

My concern, and reason for deprecation, is that cosine is woefully slow compared to dotProduct, and we have a fair bit of code that tries its best to make it faster. It would be really nice to eventually remove all that cosine comparison code and simply rely on dot-product.

Jun 29 '24 17:06 benwtrent

Now the picture is clearer to me: 1) Float vectors can be trivially normalized to the same magnitude, so the COSINE is of little use and may be deprecated easily 2) Byte vectors can not be easily normalized to the same magnitude. People usually get byte vectors by scalar quantize the float vectors. The scalar quantization applies an (a * x + b) transform, after which the normalized float vectors are not normalized anymore. Storing the byte vector magnitude is a nice way out.

Jul 01 '24 19:07 wurui90

I cannot think of an adequate work around at all for byte folks. The linear transformation of bytes will indeed cause potentially non-uniform magnitudes and could break scoring without some linear correction (we add this when quantizing ourselves).

I am going to revert the deprecation of cosine. Though I still think we should try and get rid of the SIMD & vector comparators and just use dot-product along with storing the magnitude of the vectors.

@msokolov @jmazanec15 y'all might have differing opinions.

Jul 18 '24 12:07 benwtrent

It would be interesting to know how many actual users of COSINE there are. I agree there may be no workaround, but that does not mean we need to continue to support, either. One question I have is: if I supply normalized floating point vectors and then use quantization, does this imply that the dot-product distance is somehow broken when it is calculated in quantized space? I don't think so - we account for these issues with the correction factors. Given that, I think we can say to COSINE users -- instead of COSINE, use DOT_PRODUCT and supply your vectors as normalized floats. Or ... perhaps we could even perform the normalization during indexing? There would be some loss of precision, too bad.

Jul 18 '24 13:07 msokolov

I am not sure what to do for users who quantize their own vectors & rely on cosine.

I think I am on same page as @msokolov. Users could "float_vector -> norm_float_vector -> byte_vector" and then apply dot product on byte_vectors. If float_vector->byte_vector leads to approximation on ordering for dot_product for float_vector, then why wouldnt the same logic hold for norm_float_vector and thus an approx on cosine?

The case to worry about I think is when they have a data set of byte vectors and need the cosine (i.e. higher-precision vectors are not available)? I dont think there is a workaround because of inherent ~difference in precision used for data type and distance value (i.e. byte vs double)~ edit: I think i meant to say order of operations here will result in loss of precision. That being said, to support this, I think norms would need to be stored.

Jul 18 '24 19:07 jmazanec15

@msokolov @jmazanec15

I don't know of many int8 models/datasets out there that require cosine. But, I did a benchmark with Cohere's int8 embeddings here: https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3-int8-binary which state that cosine is the correct space metric for their vectors.

I took 1M english embeddings and calculated the true nearest neighbors with Cosine. Using the same HNSW settings here are my recalls:

For cosine: 0.957
For dot-product: 0.941
For MIP: 0.941

So, its obviously not a 1-1 even for these embeddings.

I am not sure we can get rid of cosine for byte without storing the magnitude in the dataset to account for the loss.

Other byte sized datasets I could find use Euclidean distance (e.g. https://github.com/microsoft/SPTAG/tree/main/datasets/SPACEV1B).

Jul 23 '24 18:07 benwtrent

Ah I see. It seems that some models do have functionality to specify format of data returned, but I cannot seem to find others that say explicitly "for int8 data the best space is cosine".

As an alternative, could we just deprecate cosine in VectorUtil/VectorUtilSupport and not deprecate in VectorSimilarityFunction? For callers, they can handle complexity around supporting cosine. ~It seems main burden of maintaining optimized cosine will lie in those classes~ edit: I see there is a lot of branching elsewhere as well.

Jul 24 '24 15:07 jmazanec15

@benwtrent Shall we close this issue as "won't fix"?

Aug 08 '24 16:08 jpountz

agreed, I think we will miss the window

Aug 08 '24 16:08 benwtrent