faiss icon indicating copy to clipboard operation
faiss copied to clipboard

Faiss GPU: improve error information for GPU OOM

Open wickedfoo opened this issue 1 year ago • 5 comments

Summary: This diff updates logging in case of GPU out of memory errors, whether from cudaMalloc directly or from the RAFT allocator. In case of a memory error, allocator state (including an indication of CUDA-reported free memory on the device) is returned as part of the exception message, like this:

C++ exception with description "Error in virtual void *faiss::gpu::StandardGpuResourcesImpl::allocMemory(const faiss::gpu::AllocRequest &) at fbcode/faiss/gpu/StandardGpuResources.cpp:570: StandardGpuResources: Faiss device allocator fail type IVFLists dev 1 space Device stream 0x7fa07623b440 size 1024 bytes                                                                                                                                   Allocator state:                                                                                                                                                GPU device 1 allocator state:                                                                                                                                   ==========                                                                                                                                                      Device free memory: 82400968704 bytes                                                                                                                           Allocator temp memory remaining: 1610612720
Outstanding Faiss allocations:
Alloc type TemporaryMemoryBuffer: 1 allocations, 1610612736 bytes
Alloc type FlatData: 2 allocations, 59648 bytes

In the case where Faiss is built using RAFT, previously no error information was provided if the RAFT memory manager had an OOM error, but now it will produce a string similar to the above. The Faiss memory manager (StandardGpuResources) continues to log all allocations made and passed to the RAFT memory manager, so we can also receive an indication of what is allocated and for what purpose.

In addition, this fixes the issue where Faiss GPU would not compile (in fbcode at least) if the USE_NVIDIA_RAFT define was not available. Now the library compiles both with and without RAFT.

Also updated the #if defined USE_NVIDIA_RAFT to #ifdef USE_NVIDIA_RAFT` to better conform to the rest of the GPU code.

This diff also disables the temporary memory allocation of 1.5 GB made up front if RAFT is being used, which is really what is intended for using the RAFT memory manager. Otherwise this diff does not change the runtime behavior of Faiss GPU otherwise, but this diff is being made to better debug GPU OOM issues with Faiss usage.

Reviewed By: mdouze

Differential Revision: D49260364

wickedfoo avatar Sep 14 '23 02:09 wickedfoo