nmslib icon indicating copy to clipboard operation
nmslib copied to clipboard

Possible to continue setting data after building index?

Open FangLinHe opened this issue 9 years ago • 48 comments

Hello,

I really appreciate your great work, which is really efficient and high quality!

However, I am encountering a problem: since my dataset is big (more than 1 million features with 4096 dimension), I would like to build the index only one time, and when the new dataset comes in, I could simply call nmslib.setData(self.index, i, feature) where i would be greater than n specified during initialization index = nmslib.initIndex(n, ...). Is it possible to do that in some way? Thank you very much!

FangLinHe avatar Dec 28 '15 07:12 FangLinHe

Hi, thank you for the kind words.

Are you asking whether it is possible

  1. save/restore index
  2. add more data after the index is created (i.e., incremental updates).

searchivarius avatar Dec 28 '15 15:12 searchivarius

Hello searchivarius,

Thank you for your prompt reply! By that I meant whether it is possible to 2) add more data after the index is created (i.e., incremental updates).

As what I have read in the manual, it seems that some methods have to be re-indexed for new coming data, e.g. small-world graph. But for projection-based methods, I believe it is possible to achieve the goal of incremental updates.

FangLinHe avatar Dec 29 '15 02:12 FangLinHe

@FangLinHe, interestingly sw-graph index IS created incrementally. This is also true for some other methods. Yet, the generic API doesn't allow you to do so. Do you plan to use the code from Python? BTW, we have a branch where two most important indices can be read/written. But, it is not well tested yet.

searchivarius avatar Dec 29 '15 03:12 searchivarius

@searchivarius Oh, that is my fault to misunderstood. I am actually a newbie at this area. Yes, I am using Python API; so you may mean that, in C++ it allows me to incrementally update the index? It would be great if you could give me some hint how to achieve it by modifying the Python API, or even better if you also plan to do so in the future release. :) And great to know that you have a branch being able to read/write indices. I would love to give it a try in the future.

FangLinHe avatar Dec 29 '15 07:12 FangLinHe

@FangLinHe , eehh, actually it wouldn't be possible without rewriting the code quite a bit. I think it's a shame we can't support it yet, given that many methods, in principle, can do it easily. I have added the issue to the release 2.0.

searchivarius avatar Dec 29 '15 08:12 searchivarius

@searchivarius So sad to hear that, but you really did a good job. I will still use it and wait for the release 2.0 to support incremental update! Thank you so much for the discussion and answering my doubts! :)

FangLinHe avatar Dec 31 '15 07:12 FangLinHe

Please, don't close the issue though :-)

searchivarius avatar Dec 31 '15 15:12 searchivarius

Oops, my fault. Sorry!

FangLinHe avatar Dec 31 '15 15:12 FangLinHe

Reopen

FangLinHe avatar Dec 31 '15 15:12 FangLinHe

Reopen

FangLinHe avatar Dec 31 '15 15:12 FangLinHe

@searchivarius I am sorry, it was a bit unclear, could you please clarify, whether appending data to an existing index isn't easily achievable only in the Python API or with C++ directly as well?

funtusov avatar Aug 18 '16 12:08 funtusov

@funtusov Currently, we support only fully static indices, sorry. At the high level, it isn't hard to make extendible data sets, but this would require change quite a few things all the way down.

searchivarius avatar Aug 18 '16 15:08 searchivarius

Looking forward to this feature.

ajkl avatar Sep 07 '16 04:09 ajkl

I have recently added support for incremental addition and/or deletion (all should be done in batches), but it requires using C++ directly.

searchivarius avatar May 10 '17 14:05 searchivarius

I have recently added support for incremental addition and/or deletion (all should be done in batches), but it requires using C++ directly.

searchivarius avatar May 10 '17 14:05 searchivarius

@searchivarius , Great. is there any example to add new data to exist index by c++?

leafjungle avatar Feb 22 '18 06:02 leafjungle

Hi @leafjungle here it is: https://github.com/searchivarius/nmslib/blob/master/test_batch_app/test_batch_mod.cc HNSW, unfortunately, doesn't support batch deletion and addition yet, only SW-graph.

searchivarius avatar Feb 22 '18 14:02 searchivarius

@searchivarius , that means HNSW support deletion and addition one by one now? that is quite useful for online service to add/del data immediately.

leafjungle avatar Feb 24 '18 07:02 leafjungle

@leafjungle not HNSW, SW-graph only. Making HNSW dynamic isn't impossible, but I haven't done this yet.

searchivarius avatar Feb 24 '18 16:02 searchivarius

@searchivarius but even for sw-graph, it seems to be impossible to add new point batch from python binding without calling createIndex. Any idea how to do this without rebuilding the whole index? (again from python binding)

mgwizdala avatar Feb 24 '18 21:02 mgwizdala

@mgwizdala this is correct, Python bindings don't support this.

searchivarius avatar Feb 24 '18 21:02 searchivarius

Hello,

I'm currently trying to implement dynamic insertion in HNSW. After reading the code of class Hnsw<dist_t> a little, I have following question:

What is the purpose of data_rearranged_? Does data_rearranged[i] differ from ElList_[i]->getData()?

I would like to get rid of data_rearranged_ as it prevents me from relocating data_level0_memory (data_rearranged_ becomes invalid after relocation of data_level0_memory).

Sorry if this is not the correct place to ask this question.

baklazan avatar Mar 29 '18 17:03 baklazan

@baklazan there are two types of indices. One that uses data_ only. It is actually dynamic already. You can add elements to data_ and just call the add() function. It's just not visible to outside API. Buuut, there is also an optimized fully-static version of the index, which uses on the data_rerranged. If you set skip_optimized_index flag to 1, this static version of the index is never created.

searchivarius avatar Mar 29 '18 17:03 searchivarius

You can add elements to data_ and just call the add() function.

Yeah, I already did that (I also resized VisitedLists in visitedlistpool). But I'd like to make the optimized index dynamic as well, if possible.

baklazan avatar Mar 29 '18 18:03 baklazan

@baklazan it's ether static and fast or dynamic and somewhat slower. To update the static index incrementally, you have to implement some sort of an efficient merge. I wouldn't bother about this. I think people who need truly and fully dynamic indices can bite the bullet and use a 20-30% slower version of HNSW.

If the retrieval speed is paramount, but you still need to update the data, you would just use a combination of the fast but large static index and a small dynamic one. Periodically, the static index would need to be rebuilt to incorporate updates.

searchivarius avatar Mar 29 '18 18:03 searchivarius

@searchivarius Ok, thank you.

baklazan avatar Mar 29 '18 18:03 baklazan

Are there plans to add this to the Python bindings? Real-time index additions are super valuable. And I'm not sure if I understand correctly: currently only SW-graph supports dynamic indexing, and HNSW might support this later on?

Actually, I also just came across this: https://github.com/nmslib/hnsw/issues/2 - is this the same in nmslib, or is the dynamic updating only supported by the implementation in that repository?

phdowling avatar May 23 '18 09:05 phdowling

@phdowling

Actually, I also just came across this: nmslib/hnsw#2 - is this the same in nmslib, or is the dynamic updating only supported by the implementation in that repository?

It is a reimplementation, optimized for less memory usage, faster construction and incremental indexing. Nmslib's c++ interface also supports incremental insertions for sw-graph and hnsw (only regular indexes). If you only need regular dense spaces with incremental updates I would advice using hnswlib.

yurymalkov avatar May 23 '18 12:05 yurymalkov

Hello @searchivarius and @yurymalkov and thanks for the great library! We tried and tested your library with hnsw and sw-graph methods on large scale text classification sparse datasets (# objects ~ 2 millions, # features ~ 1 million) are were impressed with the results! We would now like to use your method to build efficient SVM multi-class classification iterative method.

Inside this method, on each iteration, we will perform a series of updates (i.e. deletion-insertions) to few weight vectors (i.e. vectors in hnsw index). There are few questions to this approach:

  1. Do you think the performance of the index will degrade significantly if we do a large series of incremental batch updates (like, 100k or 1m updates for index with ~1m vectors)?
  2. Do I understand correctly that in order to experiment with batch insertions and deletions in Python, we will have to implement Python bindings ourselves? Is it at all possible and how much time/effort do you think it would take?

Thanks a lot in advance and thanks for the library again!

AVBelyy avatar Jun 23 '18 21:06 AVBelyy

Hi @AVBelyy https://github.com/nmslib/hnsw supports deletions and insertions, I think. For SW-graph I experimented with deletions and insertions. Performance doesn't seem to deteriorate quickly: I think I have done something like 100K updates for 1M dataset and it was fine. However, not sure about 1M. Depends on updates.

searchivarius avatar Jun 23 '18 22:06 searchivarius