nerfstudio
nerfstudio copied to clipboard
Not possible to resume training from a checkpoint when training with `--machine.num-gpus 2`
Describe the bug
It is not possible to resume training from a checkpoint when training with --machine.num-gpus 2
.
To Reproduce Steps to reproduce the behavior:
- Run
ns-train nerfacto ... --machine.num-gpus 2
- Interrupt training after a checkpoint has been created.
- Run
ns-train nerfacto ... --machine.num-gpus 2 --trainer.load_dir outputs/path/to//nerfstudio_models/
- See error below.
RuntimeError: Error(s) in loading state_dict for VanillaPipeline:
Missing key(s) in state_dict: "_model.module.device_indicator_param", "_model.module.field.aabb", "_model.module.field.embedding_appearance.embedding.weight", ",
...
"_model.module.lpips.net.lins.4.model.1.weight".
Unexpected key(s) in state_dict: "_model.device_indicator_param", "_model.field.aabb", "_model.field.embedding_appearance.embedding.weight", "_model.field.direction_encoding.params",
...
"_model.lpips.net.lins.4.model.1.weight".
Expected behavior It should load the checkpoint, and resume training. This works fine for me if I don't attempt to train on more than one GPU.
Am I missing documentation regarding ns-train? I don't see these arguments documented anywhere
Hi can you try ‘ns-train —help’ and ‘ns-train nerfacto —help’? That will give you the documentation.
I have a similar problem with one GPU