cudnn.torch icon indicating copy to clipboard operation
cudnn.torch copied to clipboard

Potential Bug with find.lua - Multiple GPUs

Open hashbangCoder opened this issue 8 years ago • 2 comments

Hi, I have no idea how this cropped up, but require 'cudnn' threw an out of memory error

THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-7735/cutorch/init.c line=261 error=2 : out of memory
~/distro/install/share/lua/5.1/trepl/init.lua:389: 
~/distro/install/share/lua/5.1/trepl/init.lua:389: 
~/distro/install/share/lua/5.1/cudnn/find.lua:165: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7735/cutorch/init.c:261

This is strange because I have 4 GPUs (all TitanX; 2 idle and 2 busy) which can be detected by cutorch.getDeviceCount(), after explicitly setting cutorch.setDevice() to an idle device and verifying that current GPU is indeed idle using cutorch.getDevice() and cutorch.getMemoryUsage().

For some weird reason, calling require 'cudnn' sets the current device to a busy one with all the memory occupied. After digging a little into the traceback, I found that in init.lua find.reset() is called with cutorch.synchronizeAll() here. In cutorch's init.c, this call cycles through all available GPUs and performs a synchronize() Changing this to cutorch.synchronize() seems to solve this error, although I dont know if I've broken anything else.

I've tried updating all the cudnn, cunn and cutorch modules to the latest. Finally also tried a fresh install of torch, to no effect. Please let me know If I'm missing something obvious here.

OS - Ubuntu 14.04 CUDA - 7.5 cuDNN - 5103 GPUs - 4 Nvidia TitanX The 2 busy GPUs are running tensorflow which I think allocates all the memory by default.

EDIT - making that change to find.lua breaks the code.

 cublas runtime error : library not initialized at /tmp/luarocks_cutorch-scm-1-1387/cutorch/lib/THC/THCGeneral.c:378

Also tried setting CUDA_VISIBLE_DEVICES to a single GPU. This causes a long traceback to be printed

/tmp/luarocks_cutorch-scm-1-1387/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [53,0,0], thread: [116,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-1387/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [53,0,0], thread: [117,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-1387/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [53,0,0], thread: [118,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-1387/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [53,0,0], thread: [119,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-1387/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [53,0,0], thread: [120,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-1387/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [53,0,0], thread: [121,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-1387/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [53,0,0], thread: [122,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-1387/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [53,0,0], thread: [123,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-1387/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [53,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-1387/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [53,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-1387/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [53,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-1387/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [53,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

~/distro/install/share/lua/5.1/nn/Linear.lua:66: cublas runtime error : library not initialized at /tmp/luarocks_cutorch-scm-1-1387/cutorch/lib/THC/THCGeneral.c:378

hashbangCoder avatar Jan 30 '17 07:01 hashbangCoder

Preferred method to use a subset of GPUs is setting CUDA_VISIBLE_DEVICES, otherwise torch will try to create context on all the GPUs, and with memory on your "busy" GPUs already allocated that could fail. Setting CUDA_VISIBLE_DEVICES to a single GPU should work. Do you have a repro where it fails? The errors that you have (cublas not initialized) are totally unrelated to cudnn.torch, looks like something is wrong with the setup. I also suspect that require 'cutorch' would result in the same error.

ngimel avatar Jan 30 '17 17:01 ngimel

Thanks for the quick response.

Setting CUDA_VISIBLE_DEVICES to a single GPU should work.

I did just that. Tried it with each idle GPUs (one at a time) which leads to the cublas error.

I also suspect that require 'cutorch' would result in the same error.

The reason I posted it on this repo is because require cutorch or require cunn works just fine. Its require cudnn which is the problem. Infact I used cutorch to verify the current device and memory usage.

looks like something is wrong with the setup

I have all the paths (cuda/cudnn) set correctly. If its incorrect, then cutorch or cunn shouldn't load right? PATH set to PATH=$PATH:/usr/local/cuda-7.5/bin and LD_LIBRARY_PATH=/home/user/cuda/lib64/:$LD_LIBRARY_PATH

Do you have a repro where it fails?

Not sure what this means. You mean like an example code/scenario? Just running require 'cutorch'; require 'cunn'; require 'cudnn' causes this error.

Also, I checked again just now when all GPUs are idle and require cudnn loads without any issues. I'm only facing problems when the some GPUs are occupied in a multi-GPU server. Also using CUDA_VISIBLE_DEVICES set to any GPU causes it to crash (cublas error above) at all times.

EDIT - This recent cutorch issue seems very relevant to mine.

hashbangCoder avatar Jan 30 '17 21:01 hashbangCoder