FastMaskRCNN icon indicating copy to clipboard operation
FastMaskRCNN copied to clipboard

Hello ,everyone !!!Now,python train/train.py meet this issue:ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,256,160,551] Is my NIVIDA out of memory? GPU device:GeForce GTX 1050 Ti 4.0G

Open zhanglijian opened this issue 7 years ago • 9 comments

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,256,160,551] [[Node: pyramid_1/P2/rpn/convolution = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](pyramid/C2/fusion/BiasAdd, pyramid/P2/rpn/weights/read)]] [[Node: pyramid_1/AssignGTBoxes/Equal_5/_1175 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_9290_pyramid_1/AssignGTBoxes/Equal_5", tensor_type=DT_BOOL, _device="/job:localhost/replica:0/task:0/cpu:0"]]

zhanglijian avatar Jun 07 '17 05:06 zhanglijian

I think it is out of memory.

lileiNPU avatar Jun 07 '17 06:06 lileiNPU

Yes, OOM means out of memory. You have a huge number of channels (which I guess you can't change). Reduce the height and width of the input image and try again (use something like 1x48x48xc and gradually increase it to see what your card's limits are).

8GB are sometimes not enough for segmentation, let alone 4.

PavlosMelissinos avatar Jun 07 '17 11:06 PavlosMelissinos

You will need atleast 10GB for the current setting of image heights and width, if cant arrange this much you have to reduce image dimensions.

blitu12345 avatar Jun 09 '17 07:06 blitu12345

From the original paper https://arxiv.org/pdf/1703.06870.pdf

Our models can run at about 200ms per frame on a GPU, and training on COCO takes one to two days on a single 8-GPU machine

This model runs at 195ms per image on an Nvidia Tesla M40 GPU

Assuming they used the same GPU for inference as for training, the Nvidia Tesla M40 GPU provides 12 GB , they had 96 Gb in total.

However, I guess that the memory Is not split up among computations, so a single GPU with 12 GB should be enough (but this is just a guess)

kevinkit avatar Jun 09 '17 09:06 kevinkit

@kevinkit @blitu12345 @PavlosMelissinos
Thanks a lot!

zhanglijian avatar Jun 12 '17 10:06 zhanglijian

@kevinkit can I reduce the mount of images to resolve that problems?

zhanglijian avatar Jun 14 '17 12:06 zhanglijian

It won't make any difference.

The problem is that even a single batch won't fit on your gpu. For that reason you need to reduce at least one of your input's shape values: [1,256,160,551]

batch_size is already 1, so it can't be reduced.

You need to reduce either the shape of your input image or the number of channels. Of those numbers, 551 is the weirdest one. What kind of dataset has 551 classes anyway?

PavlosMelissinos avatar Jun 14 '17 14:06 PavlosMelissinos

My gpu has 6 G memory, and I resize the height and width to 1/2, and it runs well by now

anthony123 avatar Aug 08 '17 02:08 anthony123

@anthony123 I got the same issue with 6G memory, did you finally get the codes work?How did you do this resize?

NigelC15 avatar Oct 17 '17 04:10 NigelC15