autofaiss icon indicating copy to clipboard operation
autofaiss copied to clipboard

decrease memory used by merging

Open rom1504 opened this issue 2 years ago • 11 comments

Currently merging in distributed mode requires to store the whole index in memory Possible strategies:

  • improve faiss merge into to avoid putting everything in memory
  • producing N index instead of one and letting the user search in all of them at search time

rom1504 avatar Feb 11 '22 14:02 rom1504

https://gist.github.com/mdouze/7331e6fc1da2334f30706b9b9962068b example of sharding

rom1504 avatar Feb 11 '22 16:02 rom1504

https://github.com/criteo/autofaiss/issues/55 may be the same implementation as the sharding one, just need not to do the last merge

rom1504 avatar Feb 11 '22 16:02 rom1504

#29 see the comment there

This is mostly done. Could be improved just a bit more by using the merge on disk function

rom1504 avatar Feb 25 '22 08:02 rom1504

import faiss
import numpy as np
from faiss.contrib.ondisk import merge_ondisk


empty_index = faiss.read_index("PQ128_index_000")
empty_index.remove_ids(np.arange(0, empty_index.ntotal))
faiss.write_index(empty_index, "PQ128_empty")
empty_index = faiss.read_index("PQ128_empty")
block_fnames = [
    "PQ128_index_000",
    "PQ128_index_001",
]

merge_ondisk(empty_index, block_fnames, "merged_index.ivfdata")

faiss.write_index(empty_index, "populated.index")

pop = faiss.read_index("populated.index")

that's how to use merge on disk

once the populated index is created, there the merged_index.ivfdata filename is saved into populated.index so when loading populated.index only the ivf part is loaded and not the codes, so the memory usage is low.

however, the call to merge_ondisk function results in a lot of memory use so I don't understand the benefit of using merge_ondisk rather than using merge_into

I am investigating this in the hope to find a way to merge an index that is larger than memory, but this doesn't seem to fulfil this objective.

rom1504 avatar Mar 07 '22 01:03 rom1504

however, the call to merge_ondisk function results in a lot of memory use

actually that is not true!! merge_ondisk uses a lot of virtual memory, but almost no resident memory

so it means merge_ondisk can indeed be used to merge many ivf indices without using any memory

the only "issue" left is it then uses 2 files: the (small) populated.index file, and the merged_index.ivfdata file distributing these files is not great, because of the filename inside populated.index

... is there a way to make merged_index.ivfdata go back inside the index file...

rom1504 avatar Mar 07 '22 01:03 rom1504

https://github.com/facebookresearch/faiss/blob/b8fe92dfee9ea6f9c8cae27e4fc3ffeb12b5c4d2/benchs/distributed_ondisk/merge_to_ondisk.py looks interesting basically might be possible to implement a better merge on disk that doesn't use the merged_index.ivfdata file

rom1504 avatar Mar 07 '22 01:03 rom1504

using CombinedIndex https://github.com/facebookresearch/faiss/blob/b8fe92dfee9ea6f9c8cae27e4fc3ffeb12b5c4d2/benchs/distributed_ondisk/combined_index.py#L13 might be a good option as well

rom1504 avatar Mar 07 '22 01:03 rom1504

https://github.com/facebookresearch/faiss/wiki/Indexes-that-do-not-fit-in-RAM#on-disk-storage

rom1504 avatar Mar 07 '22 01:03 rom1504

https://github.com/facebookresearch/faiss/blob/b8fe92dfee9ea6f9c8cae27e4fc3ffeb12b5c4d2/benchs/distributed_ondisk/merge_to_ondisk.py looks interesting basically might be possible to implement a better merge on disk that doesn't use the merged_index.ivfdata file

no, same as merge_ondisk

rom1504 avatar Mar 07 '22 01:03 rom1504

I see no way to put ivfdata and .index together again however pop = faiss.read_index("a/populated.index", faiss.IO_FLAG_ONDISK_SAME_DIR) allows reading the 2 files assuming they are in the same folder

A) so there are 2 possible solutions to the problem of "how do I use N indices"

  1. doing the merge_on_disk once and distributing both .index and .ivfdata
  2. distribute N .index files then doing the merge_on_disk in the service, allowing to use a single index for searching

we could adapt https://github.com/criteo/autofaiss/blob/master/examples/distributed_autofaiss_n_indices.py#L28 to use 2

B) for the problem of "how to do the autofaiss merging without using ram", merge_on_disk could only be used if a (breaking) change is made to always produce these 2 files instead of 1

rom1504 avatar Mar 07 '22 02:03 rom1504

asked there https://github.com/facebookresearch/faiss/issues/2244

rom1504 avatar Mar 08 '22 01:03 rom1504