nerfstudio icon indicating copy to clipboard operation
nerfstudio copied to clipboard

Not possible to resume training from a checkpoint when training with `--machine.num-gpus 2`

Open paul-gauthier opened this issue 1 year ago • 2 comments

Describe the bug It is not possible to resume training from a checkpoint when training with --machine.num-gpus 2.

To Reproduce Steps to reproduce the behavior:

  1. Run ns-train nerfacto ... --machine.num-gpus 2
  2. Interrupt training after a checkpoint has been created.
  3. Run ns-train nerfacto ... --machine.num-gpus 2 --trainer.load_dir outputs/path/to//nerfstudio_models/
  4. See error below.
RuntimeError: Error(s) in loading state_dict for VanillaPipeline:
        Missing key(s) in state_dict: "_model.module.device_indicator_param", "_model.module.field.aabb", "_model.module.field.embedding_appearance.embedding.weight", ", 
...
 "_model.module.lpips.net.lins.4.model.1.weight".
        Unexpected key(s) in state_dict: "_model.device_indicator_param", "_model.field.aabb", "_model.field.embedding_appearance.embedding.weight", "_model.field.direction_encoding.params", 
...
"_model.lpips.net.lins.4.model.1.weight".

Expected behavior It should load the checkpoint, and resume training. This works fine for me if I don't attempt to train on more than one GPU.

paul-gauthier avatar Oct 10 '22 03:10 paul-gauthier

Am I missing documentation regarding ns-train? I don't see these arguments documented anywhere

venatiodecorus avatar Oct 10 '22 16:10 venatiodecorus

Hi can you try ‘ns-train —help’ and ‘ns-train nerfacto —help’? That will give you the documentation.

ethanweber avatar Oct 10 '22 17:10 ethanweber

I have a similar problem with one GPU

MalcolmMielle avatar Apr 28 '23 13:04 MalcolmMielle