faiss
faiss copied to clipboard
Merge multiple index files into one [not on_disk]
To improve the efficiency, I use multiple process to build several index files (block_0.index, block_1.index, block_2.index, ...), but want to merge them into one index, that could be load to GPU memory later. How could I implement this?
BTW, each index was built with "OPQ64_256,IVF256,PQ32".
See this example code: https://gist.github.com/mdouze/7331e6fc1da2334f30706b9b9962068b
https://gist.github.com/mdouze/7331e6fc1da2334f30706b9b9962068b
Thanks! Will give you feedback after a try.
The script works like a charm. Thanks a lot. How could I get more details of python api? Should I go through the cpp source code or the paper to work out this kind of problem ?
I seem to have a funky issue with the merging procedure described in the gist.
This is the code I'm currently using to merge indices:
"""populate an index"""
from tempfile import NamedTemporaryFile
import faiss
def merge_invlists(il_src, il_dest):
"""
merge inverted lists from two ArrayInvertedLists
may be added to main Faiss at some point
From: https://gist.github.com/mdouze/7331e6fc1da2334f30706b9b9962068b
"""
assert il_src.nlist == il_dest.nlist
assert il_src.code_size == il_dest.code_size
for list_no in range(il_src.nlist):
il_dest.add_entries(
list_no,
il_src.list_size(list_no),
il_src.get_ids(list_no),
il_src.get_codes(list_no),
)
def merge_indices( indices, merged_index_name):
"""merge multiple indices into 1"""
tmp_empty = NamedTemporaryFile()
tmp_merged_idx = NamedTemporaryFile()
empty_index = faiss.read_index(indices[0])
empty_index.reset()
faiss.write_index(empty_index, tmp_empty.name)
empty_index = faiss.read_index(tmp_empty.name)
ntotal = empty_index.ntotal # = 0
indices_read = []
for i in indices:
index = faiss.read_index(tmp_idx.name)
indices_read.append(index)
for i in indices_read:
merge_invlists(
faiss.extract_index_ivf(i).invlists,
faiss.extract_index_ivf(empty_index).invlists,
)
ntotal += i.ntotal
empty_index.ntotal = faiss.extract_index_ivf(empty_index).ntotal = ntotal
faiss.write_index(empty_index, tmp_merged_idx.name)
Where indices is a list of files representing indexes.
The issue I'm encountering is give index_1, index_2, and index_3, if I serve them individually, the results are spread across them. After running the merging procedure I would expect the results to be the same. However I see that tendentially, the search return items included in the index_1 (not in index_2 and index_2).
@mdouze do you have any insights on why this could occur? Do I have to retrain the merged index in order to return the correct result?
Any help would be very much appreciated! Thank you
@rodrigoalmeida94 am also facing the same issue i.e. not bale to search in multiple indexes? Did you got the solution for this?
Hi, folks, I have the same request: how can I stack/combine/merge several indexes...
I tried faiss.merge_into, but got Error: 'ivf' failed with IndexFlatIP. And I found this PR that says that "Make merge_into support all types of Index", but still have the same issue after updating faiss version to 1.7.3
btw, I see [not on_disk] in the header, maybe there is an alternative solution "on disk"?
merge_into is not specific to ondisk so it should work. Would you mind opening an issue and post the code that you are using?
@mdouze thanks for your answer, sure, created a new issue here