CARN-pytorch training slow down

training slow down

Open go2sea opened this issue 5 years ago • 6 comments

I run the training code on 2 gpu, and I found that the training time increase about 7s every 1000 steps. And I tried adding torch.cuda.empty_cache() every 1000 steps, but it doesn't help. Is there any solution for that?

Thanks.

Jan 29 '19 16:01 go2sea

Hi. Is the training time cumulatively increased by 7s in every 1k steps? I haven't plotted the wall-time graph, so I haven't known that issue and neither not sure how to solve it. Sorry.

Jan 29 '19 23:01 nmhkahn

I am wondering whether it is caused by dataloader. You may set pin memory or split data equally, or rewrite the TrainDataset. It opens h5 under init. Another way is to open h5 under getitem, which enables multi read. I works well for sqlite, so I guess it will work for h5.

Jan 31 '19 06:01 yu45020

Hi, how long did you take for 1000 steps? My training is slow using 2 K80, it takes about 20 hours for 30000+ steps. Is it normal, and I think it's too slow. ps: patch_size 64 batch_size 96

Mar 16 '19 15:03 Xingrun-Xing

Hi, what are the training datasets you train the model? Just DIV2K? In the paper, there are three datasets used in training.
Thanks.

May 09 '19 08:05 feiyangha

@feiyangha Just DIV2K. The reason for describing three datasets is that they have been widely used, but we choose to use DIV2K.

May 09 '19 13:05 nmhkahn

the dataset.py will load all the data in *.h5 into memory, so you must make sure that the memory is sufficient.

And then, your system may be disturbed by a high catch occupation, which will block your trainning for waiting memory allocating or swaping.

using 'htop' command to check your machine, and just try before trainning: sync; echo 3 > /proc/sys/vm/drop_caches

and it wiil Clear PageCache, dentries and inodes.

May it help you, good luck.

Jul 26 '19 09:07 idealboy

CARN-pytorch CARN-pytorch copied to clipboard

training slow down

CARN-pytorch
CARN-pytorch copied to clipboard