lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Avoid recalculating the norm of the target vector when using cosine metric

Open bugmakerrrrrr opened this issue 1 year ago • 1 comments

Description

Currently, in the KNN retrieval process, we use VectorSimilarityFunction#compare to calculate the score between the target vector and the current vector. This method requires recalculating the norm of the target vector each time. To avoid this repetition, we can pass the square norm of the target vector to score method. I suggest modifying the relevant interface as follows: ``` @functionalInterface public interface ByteVectorScorer { float score(byte[] vector); }

@functionalInterface public interface FloatVectorScorer { float score(float[] vector); }

public enum VectorSimilarityFunction {

EUCLIDEAN {
    @Override
    public ByteVectorScorer getVectorScorer(byte[] target) {
        return vector -> 1 / (1f + squareDistance(target, vector));
    }

    @Override
    public FloatVectorScorer getVectorScorer(float[] target) {
        return vector -> 1 / (1 + squareDistance(target, vector));
    }
},

COSINE {
    @Override
    public ByteVectorScorer getVectorScorer(byte[] target) {
        int squareNorm = dotProduct(target, target);
        return vector -> (1 + cosine(target, vector, squareNorm)) / 2;
    }

    @Override
    public FloatVectorScorer getVectorScorer(float[] target) {
        double squareNorm = dotProduct(target, target);
        return vector -> Math.max((1 + cosine(target, vector, squareNorm)) / 2, 0);
    }
};

public abstract ByteVectorScorer getVectorScorer(byte[] target);

public abstract FloatVectorScorer getVectorScorer(float[] target);

}


Any thoughts?

bugmakerrrrrr avatar Mar 14 '24 17:03 bugmakerrrrrr

I would rather not change anything related to this enumeration until we figure out: https://github.com/apache/lucene/issues/13182

As an aside, I think cosine as a metric is fairly useless. Folks using Lucene should just normalize everything and use dot product, or use max-inner-product.

benwtrent avatar Mar 14 '24 17:03 benwtrent

I would rather not change anything related to this enumeration until we figure out: #13182

got it.

As an aside, I think cosine as a metric is fairly useless. Folks using Lucene should just normalize everything and use dot product, or use max-inner-product.

IMO, if we think that cosine is useless, we should remove it and guide users to use the metric we believe is correct. Otherwise, we should try to optimize it.

bugmakerrrrrr avatar Mar 17 '24 14:03 bugmakerrrrrr

close in favour of #13281

bugmakerrrrrr avatar Apr 24 '24 12:04 bugmakerrrrrr