annoy icon indicating copy to clipboard operation
annoy copied to clipboard

How realistic is serializing the internal C++ AnnoyIndex state?

Open ApproximateIdentity opened this issue 2 years ago • 0 comments

Basically what I would want is to run something like this:

import os
import random

from annoy import AnnoyIndex

num_rows = 10000
num_trees = 10
num_dims = 512

try:
    os.remove("annoy_idx.ann")
except FileNotFoundError:
    pass

annoy_idx = AnnoyIndex(num_dims, "angular")
annoy_idx.on_disk_build("annoy_idx.ann")
for idx in range(num_rows):
    vector = [random.gauss(0, 1) for _ in range(num_dims)]
    annoy_idx.add_item(idx, vector)

annoy_idx.serialize("annoy_idx.state") # XXX - This is the magic I'm looking for

and then (after that program is done and exited) I would like to continue appending data like something like this (this adds 10,000 new rows with indices 10,000, ..., 19,999):

import os
import random

from annoy import AnnoyIndex

num_rows = 10000
num_trees = 10
num_dims = 512

try:
    os.remove("annoy_idx.ann")
except FileNotFoundError:
    pass

annoy_idx = AnnoyIndex(num_dims, "angular")
annoy_idx.deserialize("annoy_idx.state") # XXX - This is the magic I'm looking for
for idx in range(num_rows):
    vector = [random.gauss(0, 1) for _ in range(num_dims)]
    annoy_idx.add_item(idx +num_rows, vector) # XXX - Note the increase in idx variable

So basically what I want is for there to be a serialize/deserialize ability so that I can continue the flow. It seems to me like the protected data here would need to be serialized:

https://github.com/spotify/annoy/blob/master/src/annoylib.h#L847-L885

In my case it seems to basically serializing the node here:

https://github.com/spotify/annoy/blob/master/src/annoylib.h#L442-L463

So my question is the following:

How realistic is this? More specifically, assuming that I am able to successfully serialize/deserialize the state, does it seem like this would play well with the mmap in the on_disk_build() step? This is maybe too general a question, but basically my point is: is this totally crazy? Are there obvious flaws with my thinking if I decided to go this route?

Thanks for any help!

ApproximateIdentity avatar Nov 13 '22 00:11 ApproximateIdentity