contrastive-unpaired-translation Training stops around ~50 epochs

Hi, thank you for advancing the state of the art and sharing your code with a well documented project.

I am training the standard CUT on my own dataset of ~1500 images. Unfortunately, training always simply stops around 47-50 epochs. I have tried both: python train.py --dataroot [x] --name=[x] --CUT_mode CUT and python -m experiments [name] train 0; I have tried running via nohup and screen.

The process does not exit: training simply stops. There's no additional error messages or logs. With nvidia-smi, I can see the process running but with no GPU utilisation.

My specs are Ryzen 3950X / 128GB RAM / 2080 Ti, so there shouldn't be any resource constraints.

I have had training failures at 47 epoch, 49 epoch, 50 epoch, etc. It's not a constant number; but always about 4 years. Any ideas?

Oct 12 '20 03:10 danny-wu

I encountered the same issue when training the model with the "summer2winter_yosemite" dataset. My training froze at Epoch 39.

Oct 21 '20 01:10 dlshu

Same here, froze after 220000 iterations in total :/

Nov 06 '20 07:11 seaweed5

My training also froze around 100 epochs. Didn't error out, just hung. Ran with the following command: python train.py --dataroot "./datasets/{target}_{collection}" --name "{target}_{collection}_FastCUT" --CUT_mode FastCUT --no_html --checkpoints_dir "/content/gdrive/MyDrive/Computer Vision Project/FastCUT_models"

Dec 09 '20 15:12 momja

I as well encountered the same problem, and --no_html does not help in my case. Could it be related to visdom?

Dec 30 '20 13:12 humberthumbert

Apparently there was a similar issue here: https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix/issues/619 Although I haven't tried their solution, it seems that it is indeed a visdom issue.

Feb 12 '21 08:02 KellyYutongHe

contrastive-unpaired-translation contrastive-unpaired-translation copied to clipboard

Training stops around ~50 epochs

contrastive-unpaired-translation
contrastive-unpaired-translation copied to clipboard