donut Continuing from checkpoint results in: ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

Continuing from checkpoint results in: ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

Open csanadpoda opened this issue 1 year ago • 1 comments

I've wanted to pretrain the model to a new language, so I ran it on a dataset for 30 epochs. When training, the logger showed 200 M trainable params. After training and checking the results, I decided to train it some more, so I copied and modified the config yaml to point to my already trained model stored locally.

This, however, added another 59 M params to the model, as the console now says:

  | Name  | Type       | Params
-------------------------------------
0 | model | DonutModel | 259 M
-------------------------------------
259 M     Trainable params
0         Non-trainable params
259 M     Total params
1,039.623 Total estimated model params size (MB)

My initial model was just 800MB and 200 M params. Is this intentional? If not, what might've changed it? I'm using the exact same config except for the path to the model I want to train.

Apr 11 '23 13:04 csanadpoda

OK I've noticed I haven't specified the checkpoint path in the config yaml. Now I have, I pointed it to the artifacts.ckpt file, but now I'm getting an error ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group. How do I get around this?

Apr 11 '23 13:04 csanadpoda

donut donut copied to clipboard

Continuing from checkpoint results in: ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

donut
donut copied to clipboard