jvector
jvector copied to clipboard
RandomAccessScoreProvider with MapRandomAccessVectorValues + non-sequential IDs produces wrong centroid
In 97e523c306ae42c3e963484e320fa1c7432b5250 approximateCentroid()
implementation for the BuildScoreProvider
returned from BuildScoreProvider.randomAccessScoreProvider()
was updated to allow for non-sequential node IDs.
However the iteration only takes into account nodes with ID < ravv.size()
. This means that if there are actually "holes" in the ID sequence (e.g. add 100 nodes in a MapRandomAccessVectorValues
, then remove 10 starting from 0), then some nodes (those with ID >= 90 in the example) will not be taken into account while calculating the centroid.
A fix would probably require changing RandomAccessVectorValues
API to expose an iterator or the highest nodeId
that is set (or something similar).
Test that demonstrates the issue:
@Test
void testRandomAccessScoreProvider() {
Map<Integer, VectorFloat<?>> chm = new ConcurrentHashMap<>();
// nodeId's 0..49 have vector [0, 1], 49..99 are [1, 0], so centroid is [0.5, 0.5]
for (int i = 0; i < 100; i++) {
chm.put(i, createVector(i < 50));
}
RandomAccessVectorValues ravv = new MapRandomAccessVectorValues(chm, 2);
var bsp = BuildScoreProvider.randomAccessScoreProvider(ravv, VectorSimilarityFunction.COSINE);
float[] centroid = (float[]) bsp.approximateCentroid().get();
Assertions.assertArrayEquals(new float[] {0.5f, 0.5f}, centroid);
// remove even nodeId's. So we have just 50 nodeIds left now and they are no longer sequential
// centroid however should stay the same at [0.5, 0.5], since we have 25 [0, 1] and 25 [1, 0] vectors
for (int i = 0; i < 100; i = i+2) {
chm.remove(i);
}
centroid = (float[]) bsp.approximateCentroid().get();
// fails - calculated centroid is [0, 0.5] because centroid calculation only took into account node IDs < 50
Assertions.assertArrayEquals(new float[] {0.5f, 0.5f}, centroid);
}
VectorFloat<?> createVector(boolean vertical) {
VectorFloat<?> vectorFloat = VectorizationProvider.getInstance().getVectorTypeSupport().createFloatVector(2);
if (vertical) {
vectorFloat.set(0, 0);
vectorFloat.set(1, 1);
} else {
vectorFloat.set(0, 1);
vectorFloat.set(1, 0);
}
return vectorFloat;
}