autofaiss
autofaiss copied to clipboard
consider using distributed kmeans in distributed mode for a better training
https://github.com/facebookresearch/faiss/blob/b8fe92dfee9ea6f9c8cae27e4fc3ffeb12b5c4d2/benchs/distributed_ondisk/README.md#distributed-k-means
https://github.com/facebookresearch/faiss/tree/main/benchs/distributed_ondisk guide is very nice in general in particular their concept of verticale slice (what we do with subindices in our merging strategy) vs hslice (they split the ivf in inverted lists slices in order to distributed the index) is really interesting for sharding the index between multiple machines (they used that for a 1T items index POC)