faiss icon indicating copy to clipboard operation
faiss copied to clipboard

Add Search Centroids Utility Method

Open makosten opened this issue 2 years ago • 1 comments

This addresses a small gap where it is difficult to extract the nearest k centroid labels and distances from an IVF index that has preceding transforms, such as PCA or normalization.

Currently, you can use search_centroid, but that returns only a single centroid per embedding and does not return distances. You can also search directly against the ivf->quantizer, but this skips the transform, so you need to apply the transform manually. The apply_chain method for the IndexPreTransform is available, but because that returns a pointer to a float array, you get a memory leak.

The new search_centroids method is an expanded version of IVFlib search_centroid method, but accepts a k value and also supplies distances in addition to centroid labels. The old cpp search_centroid method now wraps the expanded version.

The search_centroids method is replaced in the Python implementation to make it easier to call. I chose not to replace search_centroid for backward compatibility, because a user may have implemented calling the swig interface directly.

Syntax is: def search_centroids(index, x, k=1, distances=None, labels=None)

If labels and or distances are missing, they are created from the number of embeddings and k.

E.g., D, I = faiss.search_centroids(index, x, 15)

makosten avatar Oct 05 '22 23:10 makosten

I believe these are the steps for running the clang lint locally, (macOs):

brew install clang-format@11
git ls-files | grep -E  '\.(cpp|h|cu|cuh)$' | xargs clang-format-11 -i

makosten avatar Oct 06 '22 02:10 makosten

@makosten would you mind rebasing this PR if it is possible? Thanks!

junjieqi avatar May 30 '24 16:05 junjieqi

Thanks for the contribution but I think it is a trivial addition, so I don't think it brings sufficient value to the library given the number of LOCs of the PR.

mdouze avatar Jul 08 '24 16:07 mdouze