GSCNN icon indicating copy to clipboard operation
GSCNN copied to clipboard

RuntimeError:CUDA out of memory.

Open yuki-0321 opened this issue 5 years ago • 21 comments

pytorch = 1.1.0 I can print net and visualize the net , but when I run train.py , the program was killed in "seg_out, edge_out = net(input)" . Then , I wanted to use "from thop import profile" to count model parameter size and flops , but this also had an error: "RuntimeError:module must have its parameters and buffers on device cuda:0 but found one of them on device:cpu". Then I specify device as 'cuda:0' , but this error still exists.
So I want to know how to solve these errors , can anyone tell me the params and flops of the gscnn net. In other words, how much memory is used to run this gscnn net?

yuki-0321 avatar Oct 25 '19 09:10 yuki-0321

Have you solved your problem? I also encountered the same problem: RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 7.93 GiB total capacity; 6.91 GiB already allocated; 51.88 MiB free; 122.42 MiB cached) If you solve this problem, could you please tell me the solution!thanks!

cfanfan avatar Nov 01 '19 01:11 cfanfan

@yuki-0321

cfanfan avatar Nov 01 '19 01:11 cfanfan

Hi, I resized the input img to (64, 64) size (in cityscapes.py), then the 'train.py' could run. You can try like this.

yuki-0321 avatar Nov 01 '19 02:11 yuki-0321

Hi, I resized the input img to (64, 64) size (in cityscapes.py), then the 'train.py' could run. You can try like this.

Thank you so much ! I tried to modify it the way you said, but it did not work, maybe my revision is not right . Could you please tell me where and how you make the modification? Thanks again ~

cfanfan avatar Nov 15 '19 01:11 cfanfan

Hi , in cityscapes.py , find the input 'img' and 'mask' ,you can use resize method in PIL library.

yuki-0321 avatar Nov 15 '19 08:11 yuki-0321

Hi , in cityscapes.py , find the input 'img' and 'mask' ,you can use resize method in PIL library.

thank you! thank you! thank you!

cfanfan avatar Nov 19 '19 01:11 cfanfan

Hi,How big is your GPU memory? Additionally, the input 'img' and 'mask' ,you can use resize method in PIL library. Which line are these two sentences added to?

cfanfan avatar Nov 19 '19 08:11 cfanfan

嗨,您的GPU内存有多大? 另外,输入'img'和'mask',您可以在PIL库中使用resize方法。 这两个句子添加到哪一行?

Did you solve this problem? I got the same problem

shifangtian avatar Nov 27 '19 02:11 shifangtian

My GPU memory almost 80G , these two sentences added in def getitem , after 'img' and 'mask'.

yuki-0321 avatar Nov 27 '19 10:11 yuki-0321

You can just change '--crop_size' in train.py... I can train the net with '--bs_mult=3', '--crop_size=336', '--bs_mult_val=1'(GPU memory 8G)

HAOCHENYE avatar Dec 04 '19 17:12 HAOCHENYE

I also met the cuda out of memory problem also, and then I change the architecture using '--trunk resnet50' without using '--snapshot checkpoints/best_cityscapes_checkpoint.pth'. Besides, I also use the parameter suggested by @HAOCHENYE , Seems it starting to train but very slow.(Cause I change the args.num_workers from 4 to 1 in init.py, this is because I only use one GPU<Titan Xp, 12G memory> card to train). Seems consume around 7G while execute training. I'm still observing the training process, not sure this is the training start from scratch. below is the validating console out looks like: 12-10 08:06:15.581 validating: 1 / 500 12-10 08:08:31.224 validating: 21 / 500 12-10 08:10:43.240 validating: 41 / 500 12-10 08:12:57.430 validating: 61 / 500 12-10 08:15:09.786 validating: 81 / 500 12-10 08:17:23.038 validating: 101 / 500 12-10 08:19:35.823 validating: 121 / 500 12-10 08:21:49.924 validating: 141 / 500 12-10 08:24:01.765 validating: 161 / 500 12-10 08:26:15.050 validating: 181 / 500 12-10 08:28:28.284 validating: 201 / 500 12-10 08:30:42.002 validating: 221 / 500 12-10 08:32:53.939 validating: 241 / 500

paul-adlink avatar Dec 10 '19 08:12 paul-adlink

Actually, this project only supports wideresnet because there is no code about(resnet50 or resnet101) in gscnn.py(Although the net is defined in file network).

HAOCHENYE avatar Dec 10 '19 14:12 HAOCHENYE

@HAOCHENYE Thanks for telling that and sure that it can not training using resnet50 or resnet101 under this current version. And have you successfully training using wideresnet on single GPU? (Seems author mentioned that can not train on single GPU card.)

paul-adlink avatar Dec 11 '19 01:12 paul-adlink

make --syncbn=False in train.py. Besides, I also delete some data augmention and change its image transformation because the loss is hard to converge for single GPU card based on released code. I'm still training the net and the loss seems to converge.

HAOCHENYE avatar Dec 11 '19 04:12 HAOCHENYE

Thank you very much for your reply. Did you use Resnet50 to train from scratch instead of using the original pre-training checkpoint? So how do I modify the code if I start training from scratch? Can you share this part of your code with me ? My professional level is very low, can you help me?By the way, can you share with me the environment configuration other than GPU memory? ------------------ 原始邮件 ------------------ 发件人:  "paul-adlink"<[email protected]>; 发送时间:  2019年12月10日(星期二)下午4:34 收件人:  "nv-tlabs/GSCNN"<[email protected]>; 抄送:  "剑可入鞘否"<[email protected]>;"Comment"<[email protected]>; 主题:  Re: [nv-tlabs/GSCNN] RuntimeError:CUDA out of memory. (#34)

I also met the cuda out of memory problem also, and then I change the architecture using '--trunk resnet50' without using '--snapshot checkpoints/best_cityscapes_checkpoint.pth'. Besides, I also use the parameter suggested by @HAOCHENYE , Seems it starting to train but very slow.(Cause I change the args.num_workers from 4 to 1 in init .py, this is because I only use one GPU<Titan Xp, 12G memory> card to train). Seems consume around 7G while execute training. I'm still observing the training process, not sure this is the training start from scratch. below is the validating console out looks like: 12-10 08:06:15.581 validating: 1 / 500 12-10 08:08 :31.224 validating: 21 / 500 12-10 08:10:43.240 validating: 41 / 500 12-10 08:12:57.430 validating: 61 / 500 12-10 08:15:09.786 validating: 81 / 500 12-10 08:17:23.038 validating: 101 / 500 12-10 08:19:35.823 validating: 121 / 500 12-10 08:21:49.924 validating: 141 / 500 12-10 08:24:01.765 validating: 161 / 500 12-10 08:26:15.050 validating: 181 / 500 12-10 08:28:28.284 validating : 201 / 500 12-10 08:30:42.002 validating: 221 / 500 12-10 08:32:53.939 validating: 241 / 500

— You are receiving this because you commented. Reply to this email directly, view it on GitHub , or unsubscribe .

shifangtian avatar Dec 13 '19 08:12 shifangtian

It ’s a pity that my GPU memory is only 11g

------------------ 原始邮件 ------------------ 发件人: "yuki-0321"<[email protected]>; 发送时间: 2019年11月27日(星期三) 晚上6:01 收件人: "nv-tlabs/GSCNN"<[email protected]>; 抄送: "剑可入鞘否"<[email protected]>;"Comment"<[email protected]>; 主题: Re: [nv-tlabs/GSCNN] RuntimeError:CUDA out of memory. (#34)

My GPU memory almost 80G , these two sentences added in def getitem , after 'img' and 'mask'.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

shifangtian avatar Dec 13 '19 08:12 shifangtian

@shifangtian You can follow my repo https://github.com/HAOCHENYE/GSCNN-for-Single-GPU This net is very hard to training on single GPU,my mean iou only reached 0.4084

HAOCHENYE avatar Dec 14 '19 06:12 HAOCHENYE

make --syncbn=False in train.py. Besides, I also delete some data augmention and change its image transformation because the loss is hard to converge for single GPU card based on released code. I'm still training the net and the loss seems to converge.

@HAOCHENYE Thanks!!! I will try it and I think it can train(Might met the same converge problem as you meet now) of course reducing the argumentation to limit the memory usage!

paul-adlink avatar Dec 16 '19 04:12 paul-adlink

Thank you very much for your reply. Did you use Resnet50 to train from scratch instead of using the original pre-training checkpoint? So how do I modify the code if I start training from scratch? Can you share this part of your code with me ? My professional level is very low, can you help me?By the way, can you share with me the environment configuration other than GPU memory? ------------------ 原始邮件 ------------------ 发件人:  "paul-adlink"<[email protected]>; 发送时间:  2019年12月10日(星期二)下午4:34 收件人:  "nv-tlabs/GSCNN"<[email protected]>; 抄送:  "剑可入鞘否"<[email protected]>;"Comment"<[email protected]>; 主题:  Re: [nv-tlabs/GSCNN] RuntimeError:CUDA out of memory. (#34) I also met the cuda out of memory problem also, and then I change the architecture using '--trunk resnet50' without using '--snapshot checkpoints/best_cityscapes_checkpoint.pth'. Besides, I also use the parameter suggested by @HAOCHENYE , Seems it starting to train but very slow.(Cause I change the args.num_workers from 4 to 1 in init .py, this is because I only use one GPU<Titan Xp, 12G memory> card to train). Seems consume around 7G while execute training. I'm still observing the training process, not sure this is the training start from scratch. below is the validating console out looks like: 12-10 08:06:15.581 validating: 1 / 500 12-10 08:08 :31.224 validating: 21 / 500 12-10 08:10:43.240 validating: 41 / 500 12-10 08:12:57.430 validating: 61 / 500 12-10 08:15:09.786 validating: 81 / 500 12-10 08:17:23.038 validating: 101 / 500 12-10 08:19:35.823 validating: 121 / 500 12-10 08:21:49.924 validating: 141 / 500 12-10 08:24:01.765 validating: 161 / 500 12-10 08:26:15.050 validating: 181 / 500 12-10 08:28:28.284 validating : 201 / 500 12-10 08:30:42.002 validating: 221 / 500 12-10 08:32:53.939 validating: 241 / 500 — You are receiving this because you commented. Reply to this email directly, view it on GitHub , or unsubscribe .

@HAOCHENYE mentioned there's no ResNet50/101 implemented in this project now(2019/12/16). And I also found that. About the settings where you can refer to the previous discussion comment about the settings by @HAOCHENYE and me. Good Luck!

paul-adlink avatar Dec 16 '19 04:12 paul-adlink

as stated in README. To reproduce numbers in the paper you need at least 8 GPUs. I would recommend you try at least 8 * 16GB (I reproduce numbers by 8 * 32 GB)

arieling avatar Jan 09 '20 04:01 arieling

@shifangtian You can follow my repo https://github.com/HAOCHENYE/GSCNN-for-Single-GPU This net is very hard to training on single GPU,my mean iou only reached 0.4084

WiderResNet38 can't be trained on single GPU. Please use at least 8 * 16G

arieling avatar Jan 13 '20 19:01 arieling