hnswlib icon indicating copy to clipboard operation
hnswlib copied to clipboard

stream load. Is it possible?

Open vinnitu opened this issue 1 year ago • 5 comments

I want to load network resourse to index but it failed

import requests
import io
import pickle
import hnswlib

def get_stream(url):
    response = requests.get(url)
    stream_data = response.content
    return io.BytesIO(stream_data)

model = pickle.load(get_stream('http://example.com/model')) # it works

index = hnswlib.Index(space='cosine', dim=128)
index.load_index(get_stream('http://example.com/index.hnsw')) # doesn't work

got error

TypeError: load_index(): incompatible function arguments. The following argument types are supported:
    1. (self: hnswlib.Index, path_to_index: str, max_elements: int = 0, allow_replace_deleted: bool = False) -> None

Invoked with: <hnswlib.Index(space='cosine', dim=128)>, <_io.BytesIO object at 0x7fd364e557c0>

Is it normal idea?

vinnitu avatar Jun 12 '24 12:06 vinnitu

I am not sure, but can we pass io.BytesIO as std::ifstream?

https://github.com/nmslib/hnswlib/blob/3f3429661187e4c24a490a0f148fc6bc89042b3d/hnswlib/bruteforce.h#L152

    void loadIndex(const std::ifstream &input, SpaceInterface<dist_t> *s) {
        std::streampos position;

        readBinaryPOD(input, maxelements_);
        readBinaryPOD(input, size_per_element_);
        readBinaryPOD(input, cur_element_count);

        data_size_ = s->get_data_size();
        fstdistfunc_ = s->get_dist_func();
        dist_func_param_ = s->get_dist_func_param();
        size_per_element_ = data_size_ + sizeof(labeltype);
        data_ = (char *) malloc(maxelements_ * size_per_element_);
        if (data_ == nullptr)
            throw std::runtime_error("Not enough memory: loadIndex failed to allocate data");
                                                             
        input.read(data_, maxelements_ * size_per_element_);
    
        input.close();
    }

vinnitu avatar Jun 12 '24 12:06 vinnitu

split function at first phase

    void loadStream(const std::ifstream &input, SpaceInterface<dist_t> *s) {
        readBinaryPOD(input, maxelements_);
        readBinaryPOD(input, size_per_element_);
        readBinaryPOD(input, cur_element_count);

        data_size_ = s->get_data_size();
        fstdistfunc_ = s->get_dist_func();
        dist_func_param_ = s->get_dist_func_param();
        size_per_element_ = data_size_ + sizeof(labeltype);
        data_ = (char *) malloc(maxelements_ * size_per_element_);
        if (data_ == nullptr)
            throw std::runtime_error("Not enough memory: loadIndex failed to allocate data");

        input.read(data_, maxelements_ * size_per_element_);
    }
    
    void loadIndex(const std::string &location, SpaceInterface<dist_t> *s) {
        std::ifstream input(location, std::ios::binary);
        std::streampos position;
        loadStream(input, s);
        input.close();
    }

vinnitu avatar Jun 12 '24 12:06 vinnitu

the same things with it

https://github.com/nmslib/hnswlib/blob/3f3429661187e4c24a490a0f148fc6bc89042b3d/hnswlib/hnswalg.h#L716

vinnitu avatar Jun 12 '24 12:06 vinnitu

Unfortunately, we can't just do this because functions are used.

.seekg() and .tellg() (we can simplify loading code and remove it)

and maybe std::ifstream is not compatible with io.ByteIO and we need std::istringstream

What do you think about?

vinnitu avatar Jun 12 '24 12:06 vinnitu

Take a look at #556

drons avatar Jul 19 '24 19:07 drons