c++ use multiple threads but much slower than python
I use ParallelFor as https://github.com/nmslib/hnswlib/blob/7cc0ecbd43723418f43b8e73a46debbbc3940346/python_bindings/bindings.cpp#L239
// c++ code, add 10000 points, addPoint cost 7580ms
// compile flags: -std=c++11 -g -pipe -W -Wall -fPIC -pthread -Ofast -fwrapv
int d = 256;
hnswlib::labeltype n = 10000;
std::vector<float> data(n * d);
std::mt19937 rng;
rng.seed(47);
std::uniform_real_distribution<> distrib;
for (hnswlib::labeltype i = 0; i < n * d; ++i) {
data[i] = distrib(rng);
}
hnswlib::L2Space space(d);
hnswlib::AlgorithmInterface<float>* alg_hnsw = new hnswlib::HierarchicalNSW<float>(&space, 2 * n);
int num_threads = std::thread::hardware_concurrency();
ParallelFor(0, n, num_threads, [&](size_t row, size_t threadId) {
alg_hnsw->addPoint((void *) data.data() + d * row, (size_t) row);
});
# python code, add 10000 points, add_items cost 310ms
import numpy as np
import hnswlib
import time
dim = 256
num_elements = 10000
data = np.float32(np.random.random((num_elements, dim)))
ids = np.arange(num_elements)
p = hnswlib.Index(space = 'ip', dim = dim)
p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)
start=time.time();p.add_items(data, ids);end=time.time();
Is there anything i missed in c++ code? Why c++ is mush slower?
Hi @bluemandora. One thing that might be missing is that the first element should be added before the parallel segment (this is done to avoid another global lock).
I can also imagine the hyperparameters like ef_construction and M might be different.
Hello all,
Some questions about the parallel_for section. It is parallelized at each query level right, meaning if I have only one data point to add to the the existing hnsw index, it does not benefit from the parallel_for. Same question for search. Also, addPoint() in the parallel_for scenario, is essentially not thread safe right, or there're some writing blocks while each thread is synchronizing the hnsw index.
Thanks,
Jianshu
Hi @jianshu93, Yes, the parallelization is external to individual queries. There are ways to parallelize the search for a query, but, as for my understanding, it is not as efficient in terms of usage of compute resources. The insertion (addPoint) has locks, so insertion/updates are thread-safe.