annoy
annoy copied to clipboard
How realistic is serializing the internal C++ AnnoyIndex state?
Basically what I would want is to run something like this:
import os
import random
from annoy import AnnoyIndex
num_rows = 10000
num_trees = 10
num_dims = 512
try:
os.remove("annoy_idx.ann")
except FileNotFoundError:
pass
annoy_idx = AnnoyIndex(num_dims, "angular")
annoy_idx.on_disk_build("annoy_idx.ann")
for idx in range(num_rows):
vector = [random.gauss(0, 1) for _ in range(num_dims)]
annoy_idx.add_item(idx, vector)
annoy_idx.serialize("annoy_idx.state") # XXX - This is the magic I'm looking for
and then (after that program is done and exited) I would like to continue appending data like something like this (this adds 10,000 new rows with indices 10,000, ..., 19,999):
import os
import random
from annoy import AnnoyIndex
num_rows = 10000
num_trees = 10
num_dims = 512
try:
os.remove("annoy_idx.ann")
except FileNotFoundError:
pass
annoy_idx = AnnoyIndex(num_dims, "angular")
annoy_idx.deserialize("annoy_idx.state") # XXX - This is the magic I'm looking for
for idx in range(num_rows):
vector = [random.gauss(0, 1) for _ in range(num_dims)]
annoy_idx.add_item(idx +num_rows, vector) # XXX - Note the increase in idx variable
So basically what I want is for there to be a serialize/deserialize ability so that I can continue the flow. It seems to me like the protected data here would need to be serialized:
https://github.com/spotify/annoy/blob/master/src/annoylib.h#L847-L885
In my case it seems to basically serializing the node here:
https://github.com/spotify/annoy/blob/master/src/annoylib.h#L442-L463
So my question is the following:
How realistic is this? More specifically, assuming that I am able to successfully serialize/deserialize the state, does it seem like this would play well with the mmap in the on_disk_build()
step? This is maybe too general a question, but basically my point is: is this totally crazy? Are there obvious flaws with my thinking if I decided to go this route?
Thanks for any help!