voxelnet
voxelnet copied to clipboard
Out of Memory
Hi, many thanks for your source, very helpful. But when I run the train.py, there is out of memory issue as below: ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,12,402,354,128]....... The memory = 1 x 12 x 402 x 354 x 128 = 172GB, super big. I wonder does it make sense.
Would you please give some hints how it works? Very appreciate for your help.
Thanks @FrankCAN, I met similar problem as I mentioned in referenced issue Memory usage #7
Hi, does anybody know that how to evaluate the model's memory usage? @Durant35 As you mentioned, just one tensor [1, 12, 402, 354, 128] will need more than 200GB memory, I really don't know how it works, maybe my calculation has problem. @qianguih Why your 9GB GPU can work well? Can you give us some hints how to calculate your model's memory usage? Very appreciate for your help.
I am not sure about the memory calculations but I am running the training script as we speak and it uses just over 9GB of GPU memory. I had to reduce the batch size to 1 though, as the default of 2 gave me an OOM error as well.
so I have 2 Titan-X and running with default batch size of 2, I still get out of memory exception:
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[2,128,12,402,354] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: gpu_0/gradients/gpu_0/MiddleAndRPN_/conv1/Conv3D_grad/Conv3DBackpropInputV2 = Conv3DBackpropInputV2[T=DT_FLOAT, Tshape=DT_INT32, data_format="NDHWC", dilations=[1, 1, 1, 1, 1], padding="VALID", strides=[1, 2, 1, 1, 1], _device="/job:localhost/replica:0/task:0/device:GPU:0"](gpu_0/gradients/gpu_0/MiddleAndRPN_/conv1/Conv3D_grad/Shape, MiddleAndRPN_/conv1/kernel/read, gpu_0/gradients/AddN_67)]]
`+-----------------------------------------------------------------------------+ | NVIDIA-SMI 396.26 Driver Version: 396.26 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX TIT... Off | 00000000:03:00.0 On | N/A | | 27% 63C P0 75W / 250W | 837MiB / 12211MiB | 1% Default | +-------------------------------+----------------------+----------------------+ | 1 TITAN X (Pascal) Off | 00000000:04:00.0 Off | N/A | | 28% 42C P8 10W / 250W | 2MiB / 12196MiB | 0% Default | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1113 G /usr/lib/xorg/Xorg 548MiB | | 0 1785 G compiz 170MiB | | 0 2997 G ...-token=1BAB76D404C12E1D36B9780644EC736C 112MiB | | 0 4130 G /usr/bin/nvidia-settings 0MiB | +-----------------------------------------------------------------------------+ `
Really not sure how this could be possible
@FrankCAN not 200GB , is 12 * 402 * 128 * 354 * 4 / 1024**3 = 0.8096923828125 GB ;
I got the same problem, not solved...
I have GTX 1070 with 11GB memory. batch size=2 // out of memory batch size=1 // OK, uses ~ 9GB
@jarvis-huang Could you please tell me the total time taken to train the model on your configuration using batch size = 1 ?
@kasai2210 It has been long so I don't remember exactly. It think roughly 1~2 days with normal configuration from paper.
@jarvis-huang Thanks for the estimate!