nerfstudio icon indicating copy to clipboard operation
nerfstudio copied to clipboard

Race condition between evaluation and viewer

Open sjj118 opened this issue 1 year ago • 2 comments

Both train iteration and viewer render iteration are protected by the thread lock, but eval iteration is not. This may lead to unexpected crash because of race conditions.

For example, start training by:

ns-train nerfacto --vis viewer+tensorboard --pipeline.model.predict-normals True --data data/nerfstudio/poster

When entering the eval iteration (default per 500 steps), continuously changing the viewpoint in the viewer. I have encountered two types of crash. One of them is:

Traceback (most recent call last):
  File "/home/sjj118/anaconda3/envs/nerfstudio/bin/ns-train", line 8, in <module>
Exception in thread Thread-6:
Traceback (most recent call last):
    sys.exit(entrypoint())
  File "/home/sjj118/fork-nerfstudio/nerfstudio/scripts/train.py", line 260, in entrypoint
  File "/home/sjj118/anaconda3/envs/nerfstudio/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    main(
  File "/home/sjj118/fork-nerfstudio/nerfstudio/scripts/train.py", line 246, in main
    self.run()
  File "/home/sjj118/fork-nerfstudio/nerfstudio/viewer/server/render_state_machine.py", line 206, in run
    launch(
  File "/home/sjj118/fork-nerfstudio/nerfstudio/scripts/train.py", line 185, in launch
    outputs = self._render_img(action.cam_msg)
    main_func(local_rank=0, world_size=world_size, config=config)
  File "/home/sjj118/fork-nerfstudio/nerfstudio/scripts/train.py", line 100, in train_loop
  File "/home/sjj118/fork-nerfstudio/nerfstudio/viewer/server/render_state_machine.py", line 181, in _render_img
    trainer.train()
  File "/home/sjj118/fork-nerfstudio/nerfstudio/engine/trainer.py", line 277, in train
    outputs = self.viewer.get_model().get_outputs_for_camera_ray_bundle(camera_ray_bundle)
  File "/home/sjj118/anaconda3/envs/nerfstudio/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    self.eval_iteration(step)
    return func(*args, **kwargs)
  File "/home/sjj118/fork-nerfstudio/nerfstudio/models/base_model.py", line 177, in get_outputs_for_camera_ray_bundle
  File "/home/sjj118/fork-nerfstudio/nerfstudio/utils/decorators.py", line 70, in wrapper
    outputs = self.forward(ray_bundle=ray_bundle)
  File "/home/sjj118/fork-nerfstudio/nerfstudio/models/base_model.py", line 140, in forward
    ret = func(self, *args, **kwargs)
  File "/home/sjj118/fork-nerfstudio/nerfstudio/utils/profiler.py", line 127, in inner
    return self.get_outputs(ray_bundle)
  File "/home/sjj118/fork-nerfstudio/nerfstudio/models/nerfacto.py", line 274, in get_outputs
    out = func(*args, **kwargs)
  File "/home/sjj118/fork-nerfstudio/nerfstudio/engine/trainer.py", line 490, in eval_iteration
    field_outputs = self.field.forward(ray_samples, compute_normals=self.config.predict_normals)
  File "/home/sjj118/fork-nerfstudio/nerfstudio/fields/base_field.py", line 131, in forward
    metrics_dict, images_dict = self.pipeline.get_eval_image_metrics_and_images(step=step)
  File "/home/sjj118/fork-nerfstudio/nerfstudio/utils/profiler.py", line 127, in inner
    normals = self.get_normals()
  File "/home/sjj118/fork-nerfstudio/nerfstudio/fields/base_field.py", line 88, in get_normals
    out = func(*args, **kwargs)
  File "/home/sjj118/fork-nerfstudio/nerfstudio/pipelines/base_pipeline.py", line 328, in get_eval_image_metrics_and_images
    assert (
    outputs = self.model.get_outputs_for_camera_ray_bundle(camera_ray_bundle)
  File "/home/sjj118/anaconda3/envs/nerfstudio/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
AssertionError: Sample locations and density must have the same shape besides the last dimension.
    return func(*args, **kwargs)
  File "/home/sjj118/fork-nerfstudio/nerfstudio/models/base_model.py", line 177, in get_outputs_for_camera_ray_bundle
    outputs = self.forward(ray_bundle=ray_bundle)
  File "/home/sjj118/fork-nerfstudio/nerfstudio/models/base_model.py", line 140, in forward
    return self.get_outputs(ray_bundle)
  File "/home/sjj118/fork-nerfstudio/nerfstudio/models/nerfacto.py", line 274, in get_outputs
    field_outputs = self.field.forward(ray_samples, compute_normals=self.config.predict_normals)
  File "/home/sjj118/fork-nerfstudio/nerfstudio/fields/base_field.py", line 131, in forward
    normals = self.get_normals()
  File "/home/sjj118/fork-nerfstudio/nerfstudio/fields/base_field.py", line 88, in get_normals
    assert (
AssertionError: Sample locations and density must have the same shape besides the last dimension.

This may be due to _sample_locations and _density_before_activation are from different threads.

The other one is:

Traceback (most recent call last):
  File "/home/sjj118/anaconda3/envs/nerfstudio/bin/ns-train", line 8, in <module>
    sys.exit(entrypoint())
  File "/home/sjj118/fork-nerfstudio/nerfstudio/scripts/train.py", line 260, in entrypoint
    main(
  File "/home/sjj118/fork-nerfstudio/nerfstudio/scripts/train.py", line 246, in main
    launch(
  File "/home/sjj118/fork-nerfstudio/nerfstudio/scripts/train.py", line 185, in launch
    main_func(local_rank=0, world_size=world_size, config=config)
  File "/home/sjj118/fork-nerfstudio/nerfstudio/scripts/train.py", line 100, in train_loop
    trainer.train()
  File "/home/sjj118/fork-nerfstudio/nerfstudio/engine/trainer.py", line 277, in train
    self.eval_iteration(step)
  File "/home/sjj118/fork-nerfstudio/nerfstudio/utils/decorators.py", line 70, in wrapper
    ret = func(self, *args, **kwargs)
  File "/home/sjj118/fork-nerfstudio/nerfstudio/utils/profiler.py", line 127, in inner
    out = func(*args, **kwargs)
  File "/home/sjj118/fork-nerfstudio/nerfstudio/engine/trainer.py", line 481, in eval_iteration
    _, eval_loss_dict, eval_metrics_dict = self.pipeline.get_eval_loss_dict(step=step)
  File "/home/sjj118/fork-nerfstudio/nerfstudio/utils/profiler.py", line 127, in inner
    out = func(*args, **kwargs)
  File "/home/sjj118/fork-nerfstudio/nerfstudio/pipelines/base_pipeline.py", line 313, in get_eval_loss_dict
    metrics_dict = self.model.get_metrics_dict(model_outputs, batch)
  File "/home/sjj118/fork-nerfstudio/nerfstudio/models/nerfacto.py", line 323, in get_metrics_dict
    metrics_dict["distortion"] = distortion_loss(outputs["weights_list"], outputs["ray_samples_list"])
KeyError: 'weights_list'

This is caused by the model's train/eval state being changed by another thread.

If I simply add the lock for eval iteration, those bugs disappears: https://github.com/nerfstudio-project/nerfstudio/blob/1e205c1b5bd244dcd0c2303e2bed42d1ffc17625/nerfstudio/engine/trainer.py#L276-L277

 if self.pipeline.datamanager.eval_dataset: 
     with self.train_lock:
         self.eval_iteration(step) 

But this will pause the viewer when evaluating all images which may takes a lot of time. So we may need to pass the train_lock to get_average_eval_image_metrics.

sjj118 avatar May 19 '23 12:05 sjj118

Voting up, I encountered KeyError: 'weights_list' with the same trace too

yurkovak avatar Oct 06 '23 08:10 yurkovak

Has this been resolved in the meantime? Edit: I can reproduce the behavior (and the exception is not raised when I do not open the viewer in the browser), but my working copy is a little older so it might have been fixed by now ...

fortmeier avatar Apr 05 '24 11:04 fortmeier