CARN-pytorch
CARN-pytorch copied to clipboard
training slow down
I run the training code on 2 gpu, and I found that the training time increase about 7s every 1000 steps. And I tried adding torch.cuda.empty_cache() every 1000 steps, but it doesn't help. Is there any solution for that?
Thanks.
Hi. Is the training time cumulatively increased by 7s in every 1k steps? I haven't plotted the wall-time graph, so I haven't known that issue and neither not sure how to solve it. Sorry.
I am wondering whether it is caused by dataloader. You may set pin memory or split data equally, or rewrite the TrainDataset. It opens h5 under init. Another way is to open h5 under getitem, which enables multi read. I works well for sqlite, so I guess it will work for h5.
Hi, how long did you take for 1000 steps? My training is slow using 2 K80, it takes about 20 hours for 30000+ steps. Is it normal, and I think it's too slow. ps: patch_size 64 batch_size 96
Hi, what are the training datasets you train the model? Just DIV2K? In the paper, there are three datasets used in training.
Thanks.
@feiyangha Just DIV2K. The reason for describing three datasets is that they have been widely used, but we choose to use DIV2K.
the dataset.py will load all the data in *.h5 into memory, so you must make sure that the memory is sufficient.
And then, your system may be disturbed by a high catch occupation, which will block your trainning for waiting memory allocating or swaping.
using 'htop' command to check your machine, and just try before trainning: sync; echo 3 > /proc/sys/vm/drop_caches
and it wiil Clear PageCache, dentries and inodes.
May it help you, good luck.