Summary

On initial load, faiss.index_cpu_to_gpu_multiple() takes hours to complete for ~1M vectors. Subsequent calls after that only take a few minutes, as if something is being cached. Appears to be related to this issue: Slow initial copy to GPU #815. Decided to open a new issue because I'm not compiling C++ and just need the Python bindings.

This is going to be served in a docker container, so the initial load is slow for every new container run. From a business standpoint, it is important to limit downtime and make this initial load as fast as possible. Let me know if you have any wisdom/advice to improve the initial performance.

Platform

OS: Ubuntu 22.04

GPU: NVIDIA A6000 CUDA 11.7.0

Faiss version: 1.7.3

Installed from: source

Faiss compilation options: See cmake Dockerfile section below

RUN cd faiss \
    && cmake . \
        -B build \
        -DCUDAToolkit_ROOT=/usr/local/cuda-11.7 \
        -DCMAKE_CUDA_ARCHITECTURES="86" \
        -DCMAKE_CUDA_COMPILER=$(which nvcc) \
        -DFAISS_ENABLE_GPU=ON \
        -DFAISS_ENABLE_PYTHON=ON \
        -DCMAKE_BUILD_TYPE=Release \
        -DBLA_VENDOR=Intel10_64_dyn  \
        -DBUILD_TESTING=ON \
    && make -C build -j faiss \
    && make -C build -j swigfaiss \
    && (cd build/faiss/python && python setup.py install)

Running on:

[ ] CPU
[x] GPU

Interface:

[ ] C++
[x] Python

Feb 15 '23 16:02 alinneman

I also observed this. The cuda runtime is doing a lot of compilation on the first run. I am not sure what nvcc flag would help here. @wickedfoo ?

Feb 15 '23 18:02 mdouze

Quick update here. I noticed that the initial index size does not matter to reproduce this, so I created a quick MRE that takes ~15 min to load on a fresh docker container:

import faiss

class TestLoadFAISS:
    def __init__(self) -> None:
        self.device_ids = [0, 1]
        self.co = faiss.GpuMultipleClonerOptions()
        self.co.shard = True
        self.resources = [faiss.StandardGpuResources() for _ in self.device_ids]
        
        vres = faiss.GpuResourcesVector()
        vdev = faiss.Int32Vector()
        for i, res in zip(self.device_ids, self.resources):
            vdev.push_back(i)
            vres.push_back(res)
            
        # This is the slow line:
        index = faiss.index_cpu_to_gpu_multiple(vres, vdev, faiss.IndexFlatL2(1), self.co)
        index.referenced_objects = self.resources

if __name__ == '__main__':
    TestLoadFAISS()

Let me know if you'd like my Dockerfile as well.

I found a similar issue (also mentioned here) that recommends the following:

export CUDA_CACHE_MAXSIZE=2147483647
export CUDA_CACHE_DISABLE=0
export CUDA_CACHE_PATH="/path/to/.nv/ComputeCache"

I added the host's ~/.nv directory to my docker image and added the following lines to my Dockerfile, but it does not appear to be any faster on the initial load:

ENV CUDA_CACHE_MAXSIZE=2147483647 \
    CUDA_CACHE_DISABLE=0 \
    CUDA_CACHE_PATH=/app/model/.nv/ComputeCache

If there are any suggestions, I'd be happy to hear them. Let me know if you have any questions.

Feb 24 '23 16:02 alinneman

I also observed this.

Apr 30 '24 03:04 sunxiaojie99

Next step: Document which GPUs the release is tested on & natively compiled for.

Jul 02 '24 16:07 asadoughi

index_cpu_to_gpu_multiple slow on first run, fast on subsequent

Summary

Platform