voxelnet Out of Memory

Out of Memory

Open FrankCAN opened this issue 6 years ago • 10 comments

Hi, many thanks for your source, very helpful. But when I run the train.py, there is out of memory issue as below: ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,12,402,354,128]....... The memory = 1 x 12 x 402 x 354 x 128 = 172GB, super big. I wonder does it make sense.

Would you please give some hints how it works? Very appreciate for your help.

Apr 24 '18 21:04 FrankCAN

Thanks @FrankCAN, I met similar problem as I mentioned in referenced issue Memory usage #7

Apr 25 '18 01:04 Durant35

Hi, does anybody know that how to evaluate the model's memory usage? @Durant35 As you mentioned, just one tensor [1, 12, 402, 354, 128] will need more than 200GB memory, I really don't know how it works, maybe my calculation has problem. @qianguih Why your 9GB GPU can work well? Can you give us some hints how to calculate your model's memory usage? Very appreciate for your help.

Apr 25 '18 16:04 FrankCAN

I am not sure about the memory calculations but I am running the training script as we speak and it uses just over 9GB of GPU memory. I had to reduce the batch size to 1 though, as the default of 2 gave me an OOM error as well.

Apr 26 '18 11:04 Attila94

so I have 2 Titan-X and running with default batch size of 2, I still get out of memory exception: ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[2,128,12,402,354] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: gpu_0/gradients/gpu_0/MiddleAndRPN_/conv1/Conv3D_grad/Conv3DBackpropInputV2 = Conv3DBackpropInputV2[T=DT_FLOAT, Tshape=DT_INT32, data_format="NDHWC", dilations=[1, 1, 1, 1, 1], padding="VALID", strides=[1, 2, 1, 1, 1], _device="/job:localhost/replica:0/task:0/device:GPU:0"](gpu_0/gradients/gpu_0/MiddleAndRPN_/conv1/Conv3D_grad/Shape, MiddleAndRPN_/conv1/kernel/read, gpu_0/gradients/AddN_67)]]

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1113 G /usr/lib/xorg/Xorg 548MiB | | 0 1785 G compiz 170MiB | | 0 2997 G ...-token=1BAB76D404C12E1D36B9780644EC736C 112MiB | | 0 4130 G /usr/bin/nvidia-settings 0MiB | +-----------------------------------------------------------------------------+ `

Really not sure how this could be possible

Jul 12 '18 05:07 kyle-sama

@FrankCAN not 200GB , is 12 * 402 * 128 * 354 * 4 / 1024**3 = 0.8096923828125 GB ;

Oct 30 '19 03:10 lonlonago

I got the same problem, not solved...

Oct 30 '19 03:10 lonlonago

I have GTX 1070 with 11GB memory. batch size=2 // out of memory batch size=1 // OK, uses ~ 9GB

Apr 14 '20 23:04 jarvis-huang

@jarvis-huang Could you please tell me the total time taken to train the model on your configuration using batch size = 1 ?

Jan 13 '21 07:01 kasai2210

@kasai2210 It has been long so I don't remember exactly. It think roughly 1~2 days with normal configuration from paper.

Jan 13 '21 17:01 jarvis-huang

@jarvis-huang Thanks for the estimate!

Jan 13 '21 20:01 kasai2210

voxelnet voxelnet copied to clipboard

Out of Memory

voxelnet
voxelnet copied to clipboard