faiss icon indicating copy to clipboard operation
faiss copied to clipboard

How to change nlist value dynamically with incremental update of index?

Open PankajJ08 opened this issue 2 years ago • 6 comments

We are loading our data in chunks and periodically updating faiss index as we can't load all the data at once to train. So we're passing the nlist value dynamically and adding records and training our index. But I doubt that after all the chunks, nlist value only depends on the last chunk size, not on all records(faiss.ntotal). Is this the correct way to do that? Running on: CPU

labels: help wanted

Interface: Python

n = emb.shape[0]
nlist = 4 * (math.sqrt(n))
while emb:
    if not i:
        quantizer = faiss.IndexFlatL2(model_size)
        faiss_index = faiss.IndexIVFFlat(quantizer, model_size, nlist, faiss.METRIC_L2)
        faiss_index.cp.min_points_per_centroid = 5 
        faiss_index.nprobe = 4

   else:
      faiss.nlist = nlist
      faiss_index.train(emb)  # train on the database vectors
      print(faiss_index.ntotal)
      faiss_index.add(emb)  # add the vectors and update the index
      print(faiss_index.ntotal)
      return faiss_index

PankajJ08 avatar Oct 11 '22 18:10 PankajJ08

cc Vikasdubey0551 mdouze Could you please help me with this issue, Thanks.

PankajJ08 avatar Oct 11 '22 19:10 PankajJ08

I don't understand.

mdouze avatar Oct 11 '22 21:10 mdouze

Do I need to train the index after each incremental update? We update our indexes time to time. If yes, how can I set nlist for each update of index?

PankajJ08 avatar Oct 12 '22 04:10 PankajJ08

You can't retrain an index after adding vectors to it.
https://github.com/facebookresearch/faiss/wiki/FAQ#is-re-training-an-index-supported

mdouze avatar Oct 12 '22 15:10 mdouze

So you mean I need to train an index on the sample and save it? Then add new data. I can't train on the whole dataset as it's huge and regularly update. Can I create multiple indexes, train them and merge them into a single index?

PankajJ08 avatar Oct 13 '22 06:10 PankajJ08

Retraining is only useful if there is a shift in the data distribution. Otherwise you can just add to the same trained index. NB that you cannot merge two indexes that are trained differently.

mdouze avatar Oct 13 '22 09:10 mdouze