donut
donut copied to clipboard
Continuing from checkpoint results in: ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group
I've wanted to pretrain the model to a new language, so I ran it on a dataset for 30 epochs. When training, the logger showed 200 M trainable params. After training and checking the results, I decided to train it some more, so I copied and modified the config yaml to point to my already trained model stored locally.
This, however, added another 59 M params to the model, as the console now says:
| Name | Type | Params
-------------------------------------
0 | model | DonutModel | 259 M
-------------------------------------
259 M Trainable params
0 Non-trainable params
259 M Total params
1,039.623 Total estimated model params size (MB)
My initial model was just 800MB and 200 M params. Is this intentional? If not, what might've changed it? I'm using the exact same config except for the path to the model I want to train.
OK I've noticed I haven't specified the checkpoint path in the config yaml. Now I have, I pointed it to the artifacts.ckpt
file, but now I'm getting an error ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group
. How do I get around this?