nerfstudio Not possible to resume training from a checkpoint when training with `--machine.num-gpus 2`

Not possible to resume training from a checkpoint when training with `--machine.num-gpus 2`

Open paul-gauthier opened this issue 2 years ago • 2 comments

Describe the bug It is not possible to resume training from a checkpoint when training with --machine.num-gpus 2.

To Reproduce Steps to reproduce the behavior:

Run ns-train nerfacto ... --machine.num-gpus 2
Interrupt training after a checkpoint has been created.
Run ns-train nerfacto ... --machine.num-gpus 2 --trainer.load_dir outputs/path/to//nerfstudio_models/
See error below.

RuntimeError: Error(s) in loading state_dict for VanillaPipeline:
        Missing key(s) in state_dict: "_model.module.device_indicator_param", "_model.module.field.aabb", "_model.module.field.embedding_appearance.embedding.weight", ", 
...
 "_model.module.lpips.net.lins.4.model.1.weight".
        Unexpected key(s) in state_dict: "_model.device_indicator_param", "_model.field.aabb", "_model.field.embedding_appearance.embedding.weight", "_model.field.direction_encoding.params", 
...
"_model.lpips.net.lins.4.model.1.weight".

Expected behavior It should load the checkpoint, and resume training. This works fine for me if I don't attempt to train on more than one GPU.

Oct 10 '22 03:10 paul-gauthier

Am I missing documentation regarding ns-train? I don't see these arguments documented anywhere

Oct 10 '22 16:10 venatiodecorus

Hi can you try ‘ns-train —help’ and ‘ns-train nerfacto —help’? That will give you the documentation.

Oct 10 '22 17:10 ethanweber

I have a similar problem with one GPU

Apr 28 '23 13:04 MalcolmMielle

nerfstudio nerfstudio copied to clipboard

Not possible to resume training from a checkpoint when training with `--machine.num-gpus 2`

nerfstudio
nerfstudio copied to clipboard