nerfstudio
nerfstudio copied to clipboard
Race condition between evaluation and viewer
Both train iteration and viewer render iteration are protected by the thread lock, but eval iteration is not. This may lead to unexpected crash because of race conditions.
For example, start training by:
ns-train nerfacto --vis viewer+tensorboard --pipeline.model.predict-normals True --data data/nerfstudio/poster
When entering the eval iteration (default per 500 steps), continuously changing the viewpoint in the viewer. I have encountered two types of crash. One of them is:
Traceback (most recent call last):
File "/home/sjj118/anaconda3/envs/nerfstudio/bin/ns-train", line 8, in <module>
Exception in thread Thread-6:
Traceback (most recent call last):
sys.exit(entrypoint())
File "/home/sjj118/fork-nerfstudio/nerfstudio/scripts/train.py", line 260, in entrypoint
File "/home/sjj118/anaconda3/envs/nerfstudio/lib/python3.8/threading.py", line 932, in _bootstrap_inner
main(
File "/home/sjj118/fork-nerfstudio/nerfstudio/scripts/train.py", line 246, in main
self.run()
File "/home/sjj118/fork-nerfstudio/nerfstudio/viewer/server/render_state_machine.py", line 206, in run
launch(
File "/home/sjj118/fork-nerfstudio/nerfstudio/scripts/train.py", line 185, in launch
outputs = self._render_img(action.cam_msg)
main_func(local_rank=0, world_size=world_size, config=config)
File "/home/sjj118/fork-nerfstudio/nerfstudio/scripts/train.py", line 100, in train_loop
File "/home/sjj118/fork-nerfstudio/nerfstudio/viewer/server/render_state_machine.py", line 181, in _render_img
trainer.train()
File "/home/sjj118/fork-nerfstudio/nerfstudio/engine/trainer.py", line 277, in train
outputs = self.viewer.get_model().get_outputs_for_camera_ray_bundle(camera_ray_bundle)
File "/home/sjj118/anaconda3/envs/nerfstudio/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
self.eval_iteration(step)
return func(*args, **kwargs)
File "/home/sjj118/fork-nerfstudio/nerfstudio/models/base_model.py", line 177, in get_outputs_for_camera_ray_bundle
File "/home/sjj118/fork-nerfstudio/nerfstudio/utils/decorators.py", line 70, in wrapper
outputs = self.forward(ray_bundle=ray_bundle)
File "/home/sjj118/fork-nerfstudio/nerfstudio/models/base_model.py", line 140, in forward
ret = func(self, *args, **kwargs)
File "/home/sjj118/fork-nerfstudio/nerfstudio/utils/profiler.py", line 127, in inner
return self.get_outputs(ray_bundle)
File "/home/sjj118/fork-nerfstudio/nerfstudio/models/nerfacto.py", line 274, in get_outputs
out = func(*args, **kwargs)
File "/home/sjj118/fork-nerfstudio/nerfstudio/engine/trainer.py", line 490, in eval_iteration
field_outputs = self.field.forward(ray_samples, compute_normals=self.config.predict_normals)
File "/home/sjj118/fork-nerfstudio/nerfstudio/fields/base_field.py", line 131, in forward
metrics_dict, images_dict = self.pipeline.get_eval_image_metrics_and_images(step=step)
File "/home/sjj118/fork-nerfstudio/nerfstudio/utils/profiler.py", line 127, in inner
normals = self.get_normals()
File "/home/sjj118/fork-nerfstudio/nerfstudio/fields/base_field.py", line 88, in get_normals
out = func(*args, **kwargs)
File "/home/sjj118/fork-nerfstudio/nerfstudio/pipelines/base_pipeline.py", line 328, in get_eval_image_metrics_and_images
assert (
outputs = self.model.get_outputs_for_camera_ray_bundle(camera_ray_bundle)
File "/home/sjj118/anaconda3/envs/nerfstudio/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
AssertionError: Sample locations and density must have the same shape besides the last dimension.
return func(*args, **kwargs)
File "/home/sjj118/fork-nerfstudio/nerfstudio/models/base_model.py", line 177, in get_outputs_for_camera_ray_bundle
outputs = self.forward(ray_bundle=ray_bundle)
File "/home/sjj118/fork-nerfstudio/nerfstudio/models/base_model.py", line 140, in forward
return self.get_outputs(ray_bundle)
File "/home/sjj118/fork-nerfstudio/nerfstudio/models/nerfacto.py", line 274, in get_outputs
field_outputs = self.field.forward(ray_samples, compute_normals=self.config.predict_normals)
File "/home/sjj118/fork-nerfstudio/nerfstudio/fields/base_field.py", line 131, in forward
normals = self.get_normals()
File "/home/sjj118/fork-nerfstudio/nerfstudio/fields/base_field.py", line 88, in get_normals
assert (
AssertionError: Sample locations and density must have the same shape besides the last dimension.
This may be due to _sample_locations and _density_before_activation are from different threads.
The other one is:
Traceback (most recent call last):
File "/home/sjj118/anaconda3/envs/nerfstudio/bin/ns-train", line 8, in <module>
sys.exit(entrypoint())
File "/home/sjj118/fork-nerfstudio/nerfstudio/scripts/train.py", line 260, in entrypoint
main(
File "/home/sjj118/fork-nerfstudio/nerfstudio/scripts/train.py", line 246, in main
launch(
File "/home/sjj118/fork-nerfstudio/nerfstudio/scripts/train.py", line 185, in launch
main_func(local_rank=0, world_size=world_size, config=config)
File "/home/sjj118/fork-nerfstudio/nerfstudio/scripts/train.py", line 100, in train_loop
trainer.train()
File "/home/sjj118/fork-nerfstudio/nerfstudio/engine/trainer.py", line 277, in train
self.eval_iteration(step)
File "/home/sjj118/fork-nerfstudio/nerfstudio/utils/decorators.py", line 70, in wrapper
ret = func(self, *args, **kwargs)
File "/home/sjj118/fork-nerfstudio/nerfstudio/utils/profiler.py", line 127, in inner
out = func(*args, **kwargs)
File "/home/sjj118/fork-nerfstudio/nerfstudio/engine/trainer.py", line 481, in eval_iteration
_, eval_loss_dict, eval_metrics_dict = self.pipeline.get_eval_loss_dict(step=step)
File "/home/sjj118/fork-nerfstudio/nerfstudio/utils/profiler.py", line 127, in inner
out = func(*args, **kwargs)
File "/home/sjj118/fork-nerfstudio/nerfstudio/pipelines/base_pipeline.py", line 313, in get_eval_loss_dict
metrics_dict = self.model.get_metrics_dict(model_outputs, batch)
File "/home/sjj118/fork-nerfstudio/nerfstudio/models/nerfacto.py", line 323, in get_metrics_dict
metrics_dict["distortion"] = distortion_loss(outputs["weights_list"], outputs["ray_samples_list"])
KeyError: 'weights_list'
This is caused by the model's train/eval state being changed by another thread.
If I simply add the lock for eval iteration, those bugs disappears: https://github.com/nerfstudio-project/nerfstudio/blob/1e205c1b5bd244dcd0c2303e2bed42d1ffc17625/nerfstudio/engine/trainer.py#L276-L277
if self.pipeline.datamanager.eval_dataset:
with self.train_lock:
self.eval_iteration(step)
But this will pause the viewer when evaluating all images which may takes a lot of time. So we may need to pass the train_lock
to get_average_eval_image_metrics.
Voting up, I encountered KeyError: 'weights_list'
with the same trace too
Has this been resolved in the meantime? Edit: I can reproduce the behavior (and the exception is not raised when I do not open the viewer in the browser), but my working copy is a little older so it might have been fixed by now ...