hnswlib icon indicating copy to clipboard operation
hnswlib copied to clipboard

Adding elements to an index loaded in memory iteratively

Open mehrrsaa opened this issue 6 years ago • 12 comments

Currently, from what I understand of the documentation, we would need to load an index into the memory to be able to update the number of max elements and add new elements to it.

My question is: what if the index is already loaded in the memory, some process is being run iteratively and we want to append new elements yielded from that process to this already loaded index and update the object. At the moment it seems that we need to follow this routine: build index, save it, load it back (increase max elements in load argument), add elements, save, and again...

This adds load time to this process. I was wondering if there is a way to do this without saving and loading the index and just append to an already in memory index over and over (and save when we want).

Thank you in advance for any help on this!

mehrrsaa avatar Dec 06 '18 19:12 mehrrsaa

Hi @mehrrsaa, Not sure what do you mean. There is no easy way to merge two indexes. What can be done, is an automatic extension of the number of max elements as the index grows.

yurymalkov avatar Dec 07 '18 11:12 yurymalkov

Hi @yurymalkov and thank you for the fast response! Sorry, may be I can make it more clear with an example:

Considering the example in hnswlib docs: This is what is done now: We init p and load an already built index into it and add new elements to it (which is an awesome capability to have, thank you!): p = hnswlib.Index(space='l2', dim=dim) p.load_index("first_half.bin", max_elements = num_elements) p.add_items(data2)

What I am wondering about is, if there is a way to grow the index without saving and loading it again into memory, so considering if we keep the index in the memory indefinitely, when a new batch of data comes in and exceeds its previously set "max element" limit, we would want to do something like:

p.add_items(data3, new_max_element = num_elements + len(data3))

I hope it was more clear this time. My guess is this can't be done, but I want to make sure.

mehrrsaa avatar Dec 07 '18 14:12 mehrrsaa

@mehrrsaa Yes, p.add_items(data3, new_max_element = num_elements + len(data3)) is not available at the moment. But implementing similar functionality is on the TODO list. Probably it will be done within few weeks.

yurymalkov avatar Dec 08 '18 06:12 yurymalkov

Thank you, that would be awesome!

mehrrsaa avatar Dec 10 '18 20:12 mehrrsaa

Hello,

I was wondering if there is still a plan in place to implement this functionality?

Thank you

mehrrsaa avatar Mar 12 '19 19:03 mehrrsaa

Hi @mehrrsaa Yes it is still in the plans. I am too busy right now, sorry... Will start doing it in two weeks.

yurymalkov avatar Mar 13 '19 14:03 yurymalkov

Thank you for getting back to me @yurymalkov, I appreciate it!

mehrrsaa avatar Mar 13 '19 14:03 mehrrsaa

@mehrrsaa Finally done it as a manual index resize(resize_index). Now it is the develop branch. Sorry it took that long.

yurymalkov avatar Jun 08 '19 06:06 yurymalkov

@yurymalkov hi, I have one question following from the previous discussion:

I want to build an index using 2 million samples, and in order to avoid memory problem, I'm reading the data in chunks and add to the index one by one. I set the max_elements to be 2million from the start. Currently I'm following the example code and implementing save and load and it has been working fine:

init = 1
for samples in pd.read_csv(path, chunksize=CHUNK_SIZE):
    index_vemb = hnswlib.Index(space='cosine', dim=args.dim)
    if init == 1:  # init
        index_vemb.init_index(max_elements=args.vid_cnt, ef_construction=200, M=16)  # M=16
        init = 0
    else:  # load and append new data
        index_vemb.load_index(args.model_path)
    index_vemb.add_items(samples['emb'].tolist(), sample['vid'].tolist())
    index_vemb.save_index(args.model_path)
    del index_vemb

I would like to check whether I can skip the saving and loading part? something like this:

init = 1
for samples in pd.read_csv(path, chunksize=CHUNK_SIZE):
    index_vemb = hnswlib.Index(space='cosine', dim=args.dim)
    if init == 1:  # init
        index_vemb.init_index(max_elements=args.vid_cnt, ef_construction=200, M=16)  # M=16
        init = 0
    index_vemb.add_items(samples['emb'].tolist(), sample['vid'].tolist())
index_vemb.save_index(args.model_path)

Thank you!

Allenlaobai7 avatar Oct 26 '20 09:10 Allenlaobai7

@Allenlaobai7 I am not sure I fully understand. You do not need to load from the index to add elements. I think something like this should work (though I have not tested the code):

index_vemb = hnswlib.Index(space='cosine', dim=args.dim)
index_vemb.init_index(max_elements=args.vid_cnt, ef_construction=200, M=16)  # M=16
for samples in pd.read_csv(path, chunksize=CHUNK_SIZE):
    index_vemb.add_items(samples['emb'].tolist(), sample['vid'].tolist())
index_vemb.save_index(args.model_path)

yurymalkov avatar Oct 27 '20 05:10 yurymalkov

@yurymalkov Thank you for the quick reply, I followed the code from read.me and therefore implemented the save and load part. I think it make sense to continue adding items as long as the sample size does not exceed max_elements. Let me test it later to make sure it works.

Allenlaobai7 avatar Oct 27 '20 06:10 Allenlaobai7

Ok. Thanks for the feedback! Didn't think about it... The code was to demonstrate that you can add elements after loading the index (e.g. the index if fully dynamic). Yes, you can safely add elements until the capacity is reached. And when the capacity is reached, you can use resize_index to increase it (though probably a more user-friendly way is needed).

yurymalkov avatar Oct 27 '20 06:10 yurymalkov