autofaiss
autofaiss copied to clipboard
decrease memory used by merging
Currently merging in distributed mode requires to store the whole index in memory Possible strategies:
- improve faiss merge into to avoid putting everything in memory
- producing N index instead of one and letting the user search in all of them at search time
https://gist.github.com/mdouze/7331e6fc1da2334f30706b9b9962068b example of sharding
https://github.com/criteo/autofaiss/issues/55 may be the same implementation as the sharding one, just need not to do the last merge
#29 see the comment there
This is mostly done. Could be improved just a bit more by using the merge on disk function
import faiss
import numpy as np
from faiss.contrib.ondisk import merge_ondisk
empty_index = faiss.read_index("PQ128_index_000")
empty_index.remove_ids(np.arange(0, empty_index.ntotal))
faiss.write_index(empty_index, "PQ128_empty")
empty_index = faiss.read_index("PQ128_empty")
block_fnames = [
"PQ128_index_000",
"PQ128_index_001",
]
merge_ondisk(empty_index, block_fnames, "merged_index.ivfdata")
faiss.write_index(empty_index, "populated.index")
pop = faiss.read_index("populated.index")
that's how to use merge on disk
once the populated index is created, there the merged_index.ivfdata
filename is saved into populated.index
so when loading populated.index only the ivf part is loaded and not the codes, so the memory usage is low.
however, the call to merge_ondisk
function results in a lot of memory use
so I don't understand the benefit of using merge_ondisk rather than using merge_into
I am investigating this in the hope to find a way to merge an index that is larger than memory, but this doesn't seem to fulfil this objective.
however, the call to merge_ondisk function results in a lot of memory use
actually that is not true!! merge_ondisk uses a lot of virtual memory, but almost no resident memory
so it means merge_ondisk can indeed be used to merge many ivf indices without using any memory
the only "issue" left is it then uses 2 files: the (small) populated.index file, and the merged_index.ivfdata file distributing these files is not great, because of the filename inside populated.index
... is there a way to make merged_index.ivfdata go back inside the index file...
https://github.com/facebookresearch/faiss/blob/b8fe92dfee9ea6f9c8cae27e4fc3ffeb12b5c4d2/benchs/distributed_ondisk/merge_to_ondisk.py looks interesting basically might be possible to implement a better merge on disk that doesn't use the merged_index.ivfdata file
using CombinedIndex https://github.com/facebookresearch/faiss/blob/b8fe92dfee9ea6f9c8cae27e4fc3ffeb12b5c4d2/benchs/distributed_ondisk/combined_index.py#L13 might be a good option as well
https://github.com/facebookresearch/faiss/wiki/Indexes-that-do-not-fit-in-RAM#on-disk-storage
https://github.com/facebookresearch/faiss/blob/b8fe92dfee9ea6f9c8cae27e4fc3ffeb12b5c4d2/benchs/distributed_ondisk/merge_to_ondisk.py looks interesting basically might be possible to implement a better merge on disk that doesn't use the merged_index.ivfdata file
no, same as merge_ondisk
I see no way to put ivfdata and .index together again however pop = faiss.read_index("a/populated.index", faiss.IO_FLAG_ONDISK_SAME_DIR) allows reading the 2 files assuming they are in the same folder
A) so there are 2 possible solutions to the problem of "how do I use N indices"
- doing the merge_on_disk once and distributing both .index and .ivfdata
- distribute N .index files then doing the merge_on_disk in the service, allowing to use a single index for searching
we could adapt https://github.com/criteo/autofaiss/blob/master/examples/distributed_autofaiss_n_indices.py#L28 to use 2
B) for the problem of "how to do the autofaiss merging without using ram", merge_on_disk could only be used if a (breaking) change is made to always produce these 2 files instead of 1
asked there https://github.com/facebookresearch/faiss/issues/2244