faiss icon indicating copy to clipboard operation
faiss copied to clipboard

Trainning GPUIVFPQ leads to error ( failed to cudaMalloc 1.5g, out of memory)

Open chenqingguo opened this issue 3 years ago • 5 comments

Summary

I have 3,500,000 data with 512 dim, My gpu is T4 with 16g. When I train GPUIVFPQ using these data, I got this error

Error in void faiss::gpu::allocMemorySpaceV(faiss::gpu::MemorySpace, void**, size_t) at gpu/utils/MemorySpace.cpp:27: Error: 'err == cudaSuccess' failed: failed to cudaMalloc 1610612736 bytes (error 2 out of memory)

The error occurs on train stage, not add sage. I'm sure I have enough memory in my gpu though another program are running in the same gpu.

Platform

centos 8 with docker

OS:

Faiss version: 1.6.3 Installed from: compiled by myself Faiss compilation options: using MKL and cuda 10.0 Running on:

  • GPU

Interface: C++

Reproduction instructions

auto resource=new faiss::gou::StandardGpuResource;
GpuIndexIVFPQConfig config;
config.useFloat16LookupTables=true;
index=make_shared<GpuIndexIVFPQ>(resource,d,nlist,64,8,METRIC_INNER_PRODUCT,config);
omp_set_num_threads(8);
#pragma omp parallel for
for(int i=0;i<1;i++)
    index->train(metaId.size(),metaData.data())

using omp on trainning stage can accelerate training process,I don't know why, but it works.

chenqingguo avatar Mar 15 '22 07:03 chenqingguo

update: The 1.5g memory(default 1.5g if your gpu memory >8g) is actually allocated on index constructer,so the problem is why cudaMalloc leads to error or sometime resource->initializerForDevice(device) failed.

chenqingguo avatar Mar 15 '22 09:03 chenqingguo

how much is nlist?

mdouze avatar Mar 15 '22 16:03 mdouze

nlist=128

chenqingguo avatar Mar 16 '22 01:03 chenqingguo

@mdouze The error occurs occasionally only in this situation where the gpu is also used by TensorRT, then faiss::gpu resource allocates memory may failed(I'm sure the space is enough). If there are only faiss program running on the gpu, everything is ok.

chenqingguo avatar Mar 16 '22 02:03 chenqingguo

Hi i met the same question. Do u guys know how to solve it?

Wsy002 avatar Sep 02 '22 07:09 Wsy002