pix2pixHD
pix2pixHD copied to clipboard
Training cannot be resumed after finalization (from last point)
Hello, I trained up to 200 epoch a dataset of 18.234 pictures for a month using this command "python train.py --name synthuman_512p --save_epoch_freq 200 --no_instance --label_nc 0" and it stopped automatically at 200 epoch, before reaching that point I could easily resume the training with "--continue_train", but after that I cannot resume anymore even after specifying "--save_epoch_freq 400", the process will interrupt abruptly at "create web directory ./checkpoints/synthuman_512p/web..." without giving any error message nor freezing, if I delete the file "iter.txt" (there's only "201" and "0" written on it) it seems to resume successfully but will show that's resuming from "epoch 0", I didn't notice any quality loss as it's still retaining the same quality of when it reached 200 epoch, but I was wondering if that's normal or a bug, there's no issue record on this so I thought I might be the first to report this. Even if the quality doesn't seem to be affected it's still a problem that shouldn't be happening, I lost 2 days of training because of that, before I changed that iter.txt "201" into "200" it seemed to resume from that point, but then it wouldn't go past 201 and stop abruptly.
by changing --niter
and --niter_decay
in train_options.py to increase the cycle index for epoch in range(start_epoch,opt.niter+opt.niter_decay+1)
in train.py I can work well