voxelnet icon indicating copy to clipboard operation
voxelnet copied to clipboard

Out of Memory

Open FrankCAN opened this issue 6 years ago • 10 comments

Hi, many thanks for your source, very helpful. But when I run the train.py, there is out of memory issue as below: ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,12,402,354,128]....... The memory = 1 x 12 x 402 x 354 x 128 = 172GB, super big. I wonder does it make sense.

Would you please give some hints how it works? Very appreciate for your help.

FrankCAN avatar Apr 24 '18 21:04 FrankCAN

Thanks @FrankCAN, I met similar problem as I mentioned in referenced issue Memory usage #7

image

Durant35 avatar Apr 25 '18 01:04 Durant35

Hi, does anybody know that how to evaluate the model's memory usage? @Durant35 As you mentioned, just one tensor [1, 12, 402, 354, 128] will need more than 200GB memory, I really don't know how it works, maybe my calculation has problem. @qianguih Why your 9GB GPU can work well? Can you give us some hints how to calculate your model's memory usage? Very appreciate for your help.

FrankCAN avatar Apr 25 '18 16:04 FrankCAN

I am not sure about the memory calculations but I am running the training script as we speak and it uses just over 9GB of GPU memory. I had to reduce the batch size to 1 though, as the default of 2 gave me an OOM error as well.

Attila94 avatar Apr 26 '18 11:04 Attila94

so I have 2 Titan-X and running with default batch size of 2, I still get out of memory exception: ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[2,128,12,402,354] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: gpu_0/gradients/gpu_0/MiddleAndRPN_/conv1/Conv3D_grad/Conv3DBackpropInputV2 = Conv3DBackpropInputV2[T=DT_FLOAT, Tshape=DT_INT32, data_format="NDHWC", dilations=[1, 1, 1, 1, 1], padding="VALID", strides=[1, 2, 1, 1, 1], _device="/job:localhost/replica:0/task:0/device:GPU:0"](gpu_0/gradients/gpu_0/MiddleAndRPN_/conv1/Conv3D_grad/Shape, MiddleAndRPN_/conv1/kernel/read, gpu_0/gradients/AddN_67)]]

`+-----------------------------------------------------------------------------+ | NVIDIA-SMI 396.26 Driver Version: 396.26 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX TIT... Off | 00000000:03:00.0 On | N/A | | 27% 63C P0 75W / 250W | 837MiB / 12211MiB | 1% Default | +-------------------------------+----------------------+----------------------+ | 1 TITAN X (Pascal) Off | 00000000:04:00.0 Off | N/A | | 28% 42C P8 10W / 250W | 2MiB / 12196MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1113 G /usr/lib/xorg/Xorg 548MiB | | 0 1785 G compiz 170MiB | | 0 2997 G ...-token=1BAB76D404C12E1D36B9780644EC736C 112MiB | | 0 4130 G /usr/bin/nvidia-settings 0MiB | +-----------------------------------------------------------------------------+ `

Really not sure how this could be possible

kyle-sama avatar Jul 12 '18 05:07 kyle-sama

@FrankCAN not 200GB , is 12 * 402 * 128 * 354 * 4 / 1024**3 = 0.8096923828125 GB ;

lonlonago avatar Oct 30 '19 03:10 lonlonago

I got the same problem, not solved...

lonlonago avatar Oct 30 '19 03:10 lonlonago

I have GTX 1070 with 11GB memory. batch size=2 // out of memory batch size=1 // OK, uses ~ 9GB

jarvis-huang avatar Apr 14 '20 23:04 jarvis-huang

@jarvis-huang Could you please tell me the total time taken to train the model on your configuration using batch size = 1 ?

kasai2210 avatar Jan 13 '21 07:01 kasai2210

@kasai2210 It has been long so I don't remember exactly. It think roughly 1~2 days with normal configuration from paper.

jarvis-huang avatar Jan 13 '21 17:01 jarvis-huang

@jarvis-huang Thanks for the estimate!

kasai2210 avatar Jan 13 '21 20:01 kasai2210