improved-diffusion
improved-diffusion copied to clipboard
QUERY: Load from Checkpoint
I'm trying to work with a custom model and due to a tech problem have to restart from a check point. When I use the --resume_checkpoint I get this
loading model from checkpoint: MAINMOD_70K/ema_0.9999_070000.pt...
| grad_norm | 0.0342 | | loss | 0.000191 | | loss_q2 | 0.000191 | | mse | 0.000191 | | mse_q2 | 0.000191 | | samples | 1 | | step | 0 |
saving model 0... saving model 0.9999...
On the one hand it says loading model but on the other it seems to be restarting from scratch. Am I missing something?
Answering my own comment. Although the training starts again from step one and the model is saved with counting from scratch, it does continue building upon the checkpoint.
Answering my own comment. Although the training starts again from step one and the model is saved with counting from scratch, it does continue building upon the checkpoint.
I notice the "self.resume_step" setting in train_util.py. I guess maybe you can modify it to get the correct step when resuming the training.
Probably you should use the "path/to/modelNNNNNN.pt", in train_utils.py it works as the main path, and ema_xxxx_xxxx.py will load automatically. Since in the code, they split the filename by "model" and get the steps. If the input is "ema...", then the step can not be recognized correctly and 0 will then be applied