faiss icon indicating copy to clipboard operation
faiss copied to clipboard

[Feature] Consider Alternatives to GPU Resource being Embedded Inside the Index

Open tarang-jain opened this issue 5 months ago • 2 comments

The GPU resource is a part of the index's state. This is preventing us from running multi-threaded benchmarks with a single index instance. Furthermore, according to Faiss docs, the StandardGpuResource is also not thread safe. Ideally, if there is a way to run search on the index with multiple GPU resources from different CPU threads, that would be great to measure throughput with a large number of queries.

tarang-jain avatar Aug 09 '25 01:08 tarang-jain

I wonder how we should prioritize this. IIUC this is one of the settings of the NVIDIA benchmarks. If one wants to get the most QPS out of GPU search, there would be a batching stage that clumps multiple queries together prior to submitting them to GPU no?

mdouze avatar Sep 17 '25 06:09 mdouze

The design of the resource API itself is not thread-friendly and wasn't ever meant for multi-threaded usage (e.g., what should getBlasHandle(device) do in a multithreaded case? there is no release mechanism, no way to know if we should return the allocated handle for a device previously or generate a new one, same thing with the stream API, etc). These could have a free pool of cuBLAS handles, streams, etc that could be returned in a thread-safe manner but we would need to still know when one previously granted is no longer in use (and the usage would need to be stream-ordered etc) which is not currently done in the code (and some of these resources are used with respect to a given GPU stream as well).

One option would be to write a different implementation of GpuResources that used one of your thread-safe temporary GPU memory allocators instead (so temporary memory allocations could be made and returned in whatever order; the temporary memory allocator from 2016 in classic Faiss is just a stack which is freed in the reverse order of allocation), but that would still offer issues with cuBLAS handles, stream, etc reuse I would suppose. In this case I would think using a different GpuResources instance for each index (so that each index would create its own cuBLAS handle etc) might be able to implement it, in lieu of wide code changes to ensure that streams / cuBLAS handles / pinned memory allocations etc could be used and freed in a thread-safe manner.

wickedfoo avatar Sep 24 '25 21:09 wickedfoo