nerfstudio icon indicating copy to clipboard operation
nerfstudio copied to clipboard

unstable issue in the training process

Open LuLing06 opened this issue 1 year ago • 3 comments

Describe the bug I have used nerfacto, nerfacto-big, and instant-ngp three models for my dataset. I found that the training process was unstable. It looks this way image. Is there any implementation issue? does nerfstudio implement the early stopping?

LuLing06 avatar Jul 13 '23 21:07 LuLing06

The models are typically trained for fewer iterations, ie nerfacto is only set to train for 30k iters. There are some instabilities that emerge when training for a long time. We have tried to look into them but have so far been unsucessful.

tancik avatar Jul 13 '23 21:07 tancik

Thanks for your explanation. I have found the possible reason. It might be some issues with the resume process. I found when I resumed the training, the learning rate would go back to the default (0.01). It does not load the final learning rate in the last step of the previous train. There is the picture: image

I used the code to resume training: ns-train nerfacto --experiment-name $exp_name --timestamp $timestamp --data $data --load-dir $resume_dir --output-dir $output_dir --max-num-iterations $iterations --vis $vis Note: resume_dir=$output_dir/$exp_name/$exp_name/$timestamp/nerfstudio_models

How can I resume the lr from the latest checkpoint?

The models are typically trained for fewer iterations, ie nerfacto is only set to train for 30k iters. There are some instabilities that emerge when training for a long time. We have tried to look into them but have so far been unsucessful.

LuLing06 avatar Jul 14 '23 19:07 LuLing06

Hi, I meet a similar issue in that after reloading a checkpoint, the model performance drops (p1). I checked that learning rates were correctly loaded (p2). But there seemed to be other issues with the loading, as the train losses camera_opt_regularizer and rgb_loss dropped a lot (p3). p1: image p2: image p3: image

The loading command is ns-train nerfacto --load-dir outputs/processed/nerfacto/2024-02-29_175948/nerfstudio_models --data test/multiview_train_data/32/processed --vis wandb --max-num-iterations 60000 Is there any solution for this issue?

viridityzhu avatar Feb 29 '24 11:02 viridityzhu