lucene
lucene copied to clipboard
Avoid recalculating the norm of the target vector when using cosine metric
Description
Currently, in the KNN retrieval process, we use VectorSimilarityFunction#compare to calculate the score between the target vector and the current vector. This method requires recalculating the norm of the target vector each time. To avoid this repetition, we can pass the square norm of the target vector to score method. I suggest modifying the relevant interface as follows:
```
@functionalInterface
public interface ByteVectorScorer {
float score(byte[] vector);
}
@functionalInterface public interface FloatVectorScorer { float score(float[] vector); }
public enum VectorSimilarityFunction {
EUCLIDEAN {
@Override
public ByteVectorScorer getVectorScorer(byte[] target) {
return vector -> 1 / (1f + squareDistance(target, vector));
}
@Override
public FloatVectorScorer getVectorScorer(float[] target) {
return vector -> 1 / (1 + squareDistance(target, vector));
}
},
COSINE {
@Override
public ByteVectorScorer getVectorScorer(byte[] target) {
int squareNorm = dotProduct(target, target);
return vector -> (1 + cosine(target, vector, squareNorm)) / 2;
}
@Override
public FloatVectorScorer getVectorScorer(float[] target) {
double squareNorm = dotProduct(target, target);
return vector -> Math.max((1 + cosine(target, vector, squareNorm)) / 2, 0);
}
};
public abstract ByteVectorScorer getVectorScorer(byte[] target);
public abstract FloatVectorScorer getVectorScorer(float[] target);
}
Any thoughts?
I would rather not change anything related to this enumeration until we figure out: https://github.com/apache/lucene/issues/13182
As an aside, I think cosine as a metric is fairly useless. Folks using Lucene should just normalize everything and use dot product, or use max-inner-product.
I would rather not change anything related to this enumeration until we figure out: #13182
got it.
As an aside, I think cosine as a metric is fairly useless. Folks using Lucene should just normalize everything and use dot product, or use max-inner-product.
IMO, if we think that cosine is useless, we should remove it and guide users to use the metric we believe is correct. Otherwise, we should try to optimize it.
close in favour of #13281