faiss
faiss copied to clipboard
Multi-GPU OOM on 1.7.4 release
Summary
Using new 1.7.4 compiled from source, getting OOM on GPUs with total 240GB memory where previously on 1.7.2 it was not the case (using only 128GB, but a previous generation).
Platform
OS: Ubuntu 20.04-22.04
Faiss version: v.1.7.4 release
Installed from: compiled from source
Faiss compilation options: cmake -B build . -DFAISS_ENABLE_GPU=ON -DFAISS_ENABLE_PYTHON=ON -DBUILD_SHARED_LIBS=OFF -DCMAKE_BUILD_TYPE=Release -DFAISS_OPT_LEVEL=avx2 -DCMAKE_CUDA_ARCHITECTURES="60;70;72;75;80;86"
In a container nvidia/cuda:11.7.1-cudnn8-devel-ubuntu20.04
that also has libopenblas-dev
and python3.8
.
Running on:
- [ ] CPU
- [x ] GPU
Interface:
- [ ] C++
- [x ] Python
Reproduction instructions
Using the new release 1.7.4 versus an older version (1.7.2), I am observing a higher consumption of memory for GPU indexing. Previously, I could fit ~130m vectors of 1024 float with a IVF20000,SQ4 quantizer on 8x V100 16GB GPUs (128GB total). With the new release, using 3x A100 80GB (240GB) I get GPU OOM. I am using the same configuration.
cloner_options = faiss.GpuMultipleClonerOptions()
cloner_options.shard = True
index = faiss.index_cpu_to_all_gpus(trained, cloner_options, ngpu=nb_gpus)
Have some defaults changed in 1.7.4 versus 1.7.2 in the cloner options perhaps, that allocate more memory? Based on the amount of vectors and code size, would have expected around 71GiB necessary. The data itself has not changed nor the pre-trained index weights.
Thank you.
@wickedfoo any idea why the memory usage between 1.7.2 and 1.7.4 would change?
I tried another experiment, this time with both versions always with 8x V100 16GB GPUs and same identical data and training file.
1.7.2 (from here)
1.7.4 (compiled from source as described above)
Error is
RuntimeError: Error in virtual void* faiss::gpu::StandardGpuResourcesImpl::allocMemory(const faiss::gpu::AllocRequest&) at /faiss-1.7.4/faiss/gpu/StandardGpuResources.cpp:452: Error: 'err == cudaSuccess' failed: StandardGpuResources: alloc fail type IVFLists dev 0 space Device stream 0x5654faae4600 size 501432320 bytes (cudaMalloc error out of memory [2])
Admittedly, even the 1.7.2 version was quite close to the limit. However, going to 3x A100 with 240GB doesn't seem to help either.
Hi @mdouze @wickedfoo I also compiled 1.7.2 using the same script and options as 1.7.4 above (rather than relying on the pip package which is unsupported and has a problem with A100 as per issue).
I can observe that on 1.7.2 it's working fine also on 4x A100 80GB GPUs, they are all sitting comfortably at 27GB each (total 108GB) which is similar to the amount using 8x V100 16GB (total ~116GB).
So indeed, something changed between 1.7.2 and 1.7.4. It's not a GPU architecture issue or differences in compilation.
Hello again,
I also compiled 1.7.3
. No issues there either, so the difference must be between 1.7.3 and 1.7.4
Hope that narrows things down further.
1.7.3
I observe a similar trend between 1.6.5 and 1.7.4.
https://github.com/facebookresearch/faiss/issues/3160