contrastive-unpaired-translation
contrastive-unpaired-translation copied to clipboard
Training stops around ~50 epochs
Hi, thank you for advancing the state of the art and sharing your code with a well documented project.
I am training the standard CUT on my own dataset of ~1500 images. Unfortunately, training always simply stops around 47-50 epochs. I have tried both: python train.py --dataroot [x] --name=[x] --CUT_mode CUT
and python -m experiments [name] train 0
; I have tried running via nohup
and screen
.
The process does not exit: training simply stops. There's no additional error messages or logs. With nvidia-smi
, I can see the process running but with no GPU utilisation.
My specs are Ryzen 3950X / 128GB RAM / 2080 Ti, so there shouldn't be any resource constraints.
I have had training failures at 47 epoch, 49 epoch, 50 epoch, etc. It's not a constant number; but always about 4 years. Any ideas?
I encountered the same issue when training the model with the "summer2winter_yosemite" dataset. My training froze at Epoch 39.
Same here, froze after 220000 iterations in total :/
My training also froze around 100 epochs. Didn't error out, just hung.
Ran with the following command:
python train.py --dataroot "./datasets/{target}_{collection}" --name "{target}_{collection}_FastCUT" --CUT_mode FastCUT --no_html --checkpoints_dir "/content/gdrive/MyDrive/Computer Vision Project/FastCUT_models"
I as well encountered the same problem, and --no_html does not help in my case. Could it be related to visdom?
Apparently there was a similar issue here: https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix/issues/619 Although I haven't tried their solution, it seems that it is indeed a visdom issue.