faiss
faiss copied to clipboard
Index train and add cost much time, but only 1 core was busy
Summary
Index train and add cost much time, but only 1 core was busy. So I ask how to improve the performance for Index train and add? I just use IndexFlatIVF currently.
Platform
OS: CentOS Linux release 7.6.1810 (Core), Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
Faiss version: faiss release v1.6.0
Installed from: compile C++
Faiss compilation options: -fopenmp
Running on:
- [x] CPU
- [ ] GPU
Interface:
- [x] C++
- [ ] Python
Reproduction instructions
step1: ((faiss::Index*)index_)->train(train_size, fb); step2: ((faiss::Index*)index_)->add(total_line_num, fb);
When we finished the training stage, and get out our desired cells, but we found the next add stage cost too much time. And during the add procedure, only 1 cpu core was used, the others are idle and the total CPU usage was quite low. So I am wondering whether after train stage, we can add items in parallel, or my usage has some pitfall?
demo: 9million vectors with dim 512, the training stage use 8h, and the add procedure cost 21h!
The code was probably not compiled with openmp. Could you call faiss::check_openmp()
somewhere in the code?
The code was probably not compiled with openmp. Could you call
faiss::check_openmp()
somewhere in the code?
faiss::check_openmp()
returns true.
I read the code: in IndexIVFFlat.cpp, add_with_ids calls add_core, and the add actions executes in serial. in IndexIVF.cpp, add_with_ids can deal with omp in parallel.
This means IndexIVFFlat can only eatup 1core ??
Right, this is an inconsistency. I think it's because it's not much faster with more cores. I will mark as enhancement to fix that.
Actually when faiss was built with openblasp (not default blas), the train and add precedure can eatup all the cores. Somewhat wired, but why?
IndexIVFPQ
Right, this is an inconsistency. I think it's because it's not much faster with more cores. I will mark as enhancement to fix that.
Function 'add_core_o' in IndexIVFPQ.cpp has the same issue, and there is a parallelize TODO inside.
For IndexIVFPQ:
add_core_o method consists of three main parts. first part(compute Ids), the second part(product quantizer compute codes) is the actual bottleneck in this method but the third part(add vectors to the invlist) which is needed to parallelise .
By checking the relative time of the third part to the second one to find is it really useful and will affect the total running time of this method or not, you will find that the third part, needed to parallelise, is actually run in no more that 2% of the total time, as seen in the screenshot, and even if we improved this part to make it run in 0 ms it wouldn't be remarkable.
so, it will not be valuable to add more complications to code without gain valuable improvements.