index_cpu_to_gpu_multiple slow on first run, fast on subsequent
Summary
On initial load, faiss.index_cpu_to_gpu_multiple() takes hours to complete for ~1M vectors. Subsequent calls after that only take a few minutes, as if something is being cached. Appears to be related to this issue: Slow initial copy to GPU #815. Decided to open a new issue because I'm not compiling C++ and just need the Python bindings.
This is going to be served in a docker container, so the initial load is slow for every new container run. From a business standpoint, it is important to limit downtime and make this initial load as fast as possible. Let me know if you have any wisdom/advice to improve the initial performance.
Platform
OS: Ubuntu 22.04
GPU: NVIDIA A6000 CUDA 11.7.0
Faiss version: 1.7.3
Installed from: source
Faiss compilation options: See cmake Dockerfile section below
RUN cd faiss \
&& cmake . \
-B build \
-DCUDAToolkit_ROOT=/usr/local/cuda-11.7 \
-DCMAKE_CUDA_ARCHITECTURES="86" \
-DCMAKE_CUDA_COMPILER=$(which nvcc) \
-DFAISS_ENABLE_GPU=ON \
-DFAISS_ENABLE_PYTHON=ON \
-DCMAKE_BUILD_TYPE=Release \
-DBLA_VENDOR=Intel10_64_dyn \
-DBUILD_TESTING=ON \
&& make -C build -j faiss \
&& make -C build -j swigfaiss \
&& (cd build/faiss/python && python setup.py install)
Running on:
- [ ] CPU
- [x] GPU
Interface:
- [ ] C++
- [x] Python
I also observed this. The cuda runtime is doing a lot of compilation on the first run. I am not sure what nvcc flag would help here. @wickedfoo ?
Quick update here. I noticed that the initial index size does not matter to reproduce this, so I created a quick MRE that takes ~15 min to load on a fresh docker container:
import faiss
class TestLoadFAISS:
def __init__(self) -> None:
self.device_ids = [0, 1]
self.co = faiss.GpuMultipleClonerOptions()
self.co.shard = True
self.resources = [faiss.StandardGpuResources() for _ in self.device_ids]
vres = faiss.GpuResourcesVector()
vdev = faiss.Int32Vector()
for i, res in zip(self.device_ids, self.resources):
vdev.push_back(i)
vres.push_back(res)
# This is the slow line:
index = faiss.index_cpu_to_gpu_multiple(vres, vdev, faiss.IndexFlatL2(1), self.co)
index.referenced_objects = self.resources
if __name__ == '__main__':
TestLoadFAISS()
Let me know if you'd like my Dockerfile as well.
I found a similar issue (also mentioned here) that recommends the following:
export CUDA_CACHE_MAXSIZE=2147483647
export CUDA_CACHE_DISABLE=0
export CUDA_CACHE_PATH="/path/to/.nv/ComputeCache"
I added the host's ~/.nv directory to my docker image and added the following lines to my Dockerfile, but it does not appear to be any faster on the initial load:
ENV CUDA_CACHE_MAXSIZE=2147483647 \
CUDA_CACHE_DISABLE=0 \
CUDA_CACHE_PATH=/app/model/.nv/ComputeCache
If there are any suggestions, I'd be happy to hear them. Let me know if you have any questions.
I also observed this.
Next step: Document which GPUs the release is tested on & natively compiled for.