autofaiss icon indicating copy to clipboard operation
autofaiss copied to clipboard

multi index ideas

Open rom1504 opened this issue 2 years ago • 4 comments

  • building one index or a thousand indices from one embedding set has the same cost if doing one training and grouping at read time (allows doing one index per strict category)
  • building N index-parts then merging may make it easier to parallelize reading and building. it could also post pone the memory cost only at merge time, which might be beneficial (for example unlocks building in many memory constrained executors then merging in one big machine afterwards, or maybe even merging with memory mapping to use no memory for merging)

some info at https://github.com/facebookresearch/faiss/tree/main/benchs/distributed_ondisk and https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectors and https://github.com/facebookresearch/faiss/blob/151e3d7be54aec844b6328dc3e7dd0b83fcfa5bc/faiss/invlists/OnDiskInvertedLists.cpp

rom1504 avatar Nov 21 '21 18:11 rom1504

the second idea indeed seem it will work, see https://github.com/facebookresearch/faiss/blob/main/benchs/distributed_ondisk/merge_to_ondisk.py

it seems possible to merge an ivf index without loading it in memory

that means it should be possible to create N index parts in parallel (even in only one machine to parallelize reading) which will take N * (size of the part + loaded embedding) in memory (for example 4 * 1GB) and then to merge at the end. That will reduce the overall memory requirement from currently size of the index + add overhead (eg 20GB) to only the ongoing parts (eg 4GB) and allow creating very large indices at constant memory usage

combining with #17 it should allow creating such large indices at low memory usage.

edit: that script doesn't do exactly that, it doesn't merge the codes. Would need something slightly different for this use case

rom1504 avatar Nov 21 '21 19:11 rom1504

fyi @hitchhicker (related to what we discussed)

rom1504 avatar Feb 02 '22 11:02 rom1504

The production of an index in distributed mode is now done (see https://github.com/criteo/autofaiss/pull/71)

Follow ups here could be :

  • Using merge on disk function https://github.com/criteo/autofaiss/pull/71#issuecomment-1050325064 to decrease memory usage even when producing a single index. If doing that the memory needed will be only max(memory_for_training, memory_for_adding_one_index_shard) ie around 1GB
  • considering whether it would be easy/convenient to add an option to sort the embeddings by a given key, compute the split values for this key in order to implement a group by key to produce one index per key value (and then use this sorted collection and the split values to produce multiple indices efficiently) For example a "category" key with 100 values would result in 100 indices. That or implement one stage knn+filtering in faiss.

rom1504 avatar Feb 25 '22 08:02 rom1504

I'm pretty sure that SliceInvertedLists (https://github.com/facebookresearch/faiss/blob/151e3d7be54aec844b6328dc3e7dd0b83fcfa5bc/faiss/invlists/InvertedLists.h#L272 https://github.com/facebookresearch/faiss/blob/151e3d7be54aec844b6328dc3e7dd0b83fcfa5bc/faiss/invlists/InvertedLists.cpp#L321) can be used to create partitioned indices for a big index at no additional memory/disk cost (ie loading inverted lists only once for both the big index and the partitioned indices)

rom1504 avatar Mar 12 '22 19:03 rom1504