faiss icon indicating copy to clipboard operation
faiss copied to clipboard

Index train and add cost much time, but only 1 core was busy

Open taozhijiang opened this issue 4 years ago • 7 comments

Summary

Index train and add cost much time, but only 1 core was busy. So I ask how to improve the performance for Index train and add? I just use IndexFlatIVF currently.

Platform

OS: CentOS Linux release 7.6.1810 (Core), Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz

Faiss version: faiss release v1.6.0

Installed from: compile C++

Faiss compilation options: -fopenmp

Running on:

  • [x] CPU
  • [ ] GPU

Interface:

  • [x] C++
  • [ ] Python

Reproduction instructions

step1: ((faiss::Index*)index_)->train(train_size, fb); step2: ((faiss::Index*)index_)->add(total_line_num, fb);

When we finished the training stage, and get out our desired cells, but we found the next add stage cost too much time. And during the add procedure, only 1 cpu core was used, the others are idle and the total CPU usage was quite low. So I am wondering whether after train stage, we can add items in parallel, or my usage has some pitfall?

demo: 9million vectors with dim 512, the training stage use 8h, and the add procedure cost 21h!

taozhijiang avatar Jan 11 '21 10:01 taozhijiang

The code was probably not compiled with openmp. Could you call faiss::check_openmp() somewhere in the code?

mdouze avatar Jan 12 '21 08:01 mdouze

The code was probably not compiled with openmp. Could you call faiss::check_openmp() somewhere in the code?

faiss::check_openmp() returns true.

taozhijiang avatar Jan 12 '21 10:01 taozhijiang

I read the code: in IndexIVFFlat.cpp, add_with_ids calls add_core, and the add actions executes in serial. in IndexIVF.cpp, add_with_ids can deal with omp in parallel.

This means IndexIVFFlat can only eatup 1core ??

taozhijiang avatar Jan 13 '21 03:01 taozhijiang

Right, this is an inconsistency. I think it's because it's not much faster with more cores. I will mark as enhancement to fix that.

mdouze avatar Apr 02 '21 16:04 mdouze

Actually when faiss was built with openblasp (not default blas), the train and add precedure can eatup all the cores. Somewhat wired, but why?

taozhijiang avatar May 25 '21 10:05 taozhijiang

IndexIVFPQ

Right, this is an inconsistency. I think it's because it's not much faster with more cores. I will mark as enhancement to fix that.

Function 'add_core_o' in IndexIVFPQ.cpp has the same issue, and there is a parallelize TODO inside.

r00tk1ts avatar Aug 17 '21 09:08 r00tk1ts

For IndexIVFPQ: add_core_o method consists of three main parts. first part(compute Ids), the second part(product quantizer compute codes) is the actual bottleneck in this method but the third part(add vectors to the invlist) which is needed to parallelise . By checking the relative time of the third part to the second one to find is it really useful and will affect the total running time of this method or not, you will find that the third part, needed to parallelise, is actually run in no more that 2% of the total time, as seen in the screenshot, and even if we improved this part to make it run in 0 ms it wouldn't be remarkable. so, it will not be valuable to add more complications to code without gain valuable improvements. Screenshot 2022-09-13 at 11 21 31 am

AbdelrahmanElmeniawy avatar Sep 13 '22 14:09 AbdelrahmanElmeniawy