jvector icon indicating copy to clipboard operation
jvector copied to clipboard

RandomAccessScoreProvider with MapRandomAccessVectorValues + non-sequential IDs produces wrong centroid

Open vbekiaris opened this issue 6 months ago • 3 comments

In 97e523c306ae42c3e963484e320fa1c7432b5250 approximateCentroid() implementation for the BuildScoreProvider returned from BuildScoreProvider.randomAccessScoreProvider() was updated to allow for non-sequential node IDs.

However the iteration only takes into account nodes with ID < ravv.size(). This means that if there are actually "holes" in the ID sequence (e.g. add 100 nodes in a MapRandomAccessVectorValues, then remove 10 starting from 0), then some nodes (those with ID >= 90 in the example) will not be taken into account while calculating the centroid.

A fix would probably require changing RandomAccessVectorValues API to expose an iterator or the highest nodeId that is set (or something similar).

Test that demonstrates the issue:

    @Test
    void testRandomAccessScoreProvider() {
        Map<Integer, VectorFloat<?>> chm = new ConcurrentHashMap<>();
        // nodeId's 0..49 have vector [0, 1], 49..99 are [1, 0], so centroid is [0.5, 0.5]
        for (int i = 0; i < 100; i++) {
            chm.put(i, createVector(i < 50));
        }
        RandomAccessVectorValues ravv = new MapRandomAccessVectorValues(chm, 2);
        var bsp = BuildScoreProvider.randomAccessScoreProvider(ravv, VectorSimilarityFunction.COSINE);
        float[] centroid = (float[]) bsp.approximateCentroid().get();
        Assertions.assertArrayEquals(new float[] {0.5f, 0.5f}, centroid);
        // remove even nodeId's. So we have just 50 nodeIds left now and they are no longer sequential
        // centroid however should stay the same at [0.5, 0.5], since we have 25 [0, 1] and 25 [1, 0] vectors
        for (int i = 0; i < 100; i = i+2) {
            chm.remove(i);
        }
        centroid = (float[]) bsp.approximateCentroid().get();
        // fails - calculated centroid is [0, 0.5] because centroid calculation only took into account node IDs < 50
        Assertions.assertArrayEquals(new float[] {0.5f, 0.5f}, centroid);
    }

    VectorFloat<?> createVector(boolean vertical) {
        VectorFloat<?> vectorFloat = VectorizationProvider.getInstance().getVectorTypeSupport().createFloatVector(2);
        if (vertical) {
            vectorFloat.set(0, 0);
            vectorFloat.set(1, 1);
        } else {
            vectorFloat.set(0, 1);
            vectorFloat.set(1, 0);
        }
        return vectorFloat;
    }

vbekiaris avatar Aug 14 '24 09:08 vbekiaris