improved-diffusion QUERY: Load from Checkpoint

QUERY: Load from Checkpoint

Open Shadeenu opened this issue 3 years ago • 3 comments

I'm trying to work with a custom model and due to a tech problem have to restart from a check point. When I use the --resume_checkpoint I get this

loading model from checkpoint: MAINMOD_70K/ema_0.9999_070000.pt...

| grad_norm | 0.0342 | | loss | 0.000191 | | loss_q2 | 0.000191 | | mse | 0.000191 | | mse_q2 | 0.000191 | | samples | 1 | | step | 0 |

saving model 0... saving model 0.9999...

On the one hand it says loading model but on the other it seems to be restarting from scratch. Am I missing something?

Aug 10 '22 12:08 Shadeenu

Answering my own comment. Although the training starts again from step one and the model is saved with counting from scratch, it does continue building upon the checkpoint.

Aug 16 '22 10:08 Shadeenu

Answering my own comment. Although the training starts again from step one and the model is saved with counting from scratch, it does continue building upon the checkpoint.

I notice the "self.resume_step" setting in train_util.py. I guess maybe you can modify it to get the correct step when resuming the training.

Jan 16 '23 11:01 Suimingzhe

Probably you should use the "path/to/modelNNNNNN.pt", in train_utils.py it works as the main path, and ema_xxxx_xxxx.py will load automatically. Since in the code, they split the filename by "model" and get the steps. If the input is "ema...", then the step can not be recognized correctly and 0 will then be applied

Jan 29 '23 10:01 WuhaoStatistic

improved-diffusion improved-diffusion copied to clipboard

QUERY: Load from Checkpoint

loading model from checkpoint: MAINMOD_70K/ema_0.9999_070000.pt...

| grad_norm | 0.0342 | | loss | 0.000191 | | loss_q2 | 0.000191 | | mse | 0.000191 | | mse_q2 | 0.000191 | | samples | 1 | | step | 0 |

improved-diffusion
improved-diffusion copied to clipboard