nerfstudio icon indicating copy to clipboard operation
nerfstudio copied to clipboard

splatfacto: "RuntimeError: can't retain_grad on Tensor that has requires_grad=False"

Open bchretien opened this issue 4 months ago • 7 comments

Describe the bug

I randomly get this error during training with splatfacto (tested on the c491e3e1 commit in Docker):

Step (% Done)       Train Iter (time)    ETA (time)           Train Rays / Sec     Test Rays / Sec
--------------------------------------------------------------------------------------------------------
27310 (91.03%)      135.259 ms           6 m, 3 s             19.69 M                                    4.11 M
27320 (91.07%)      134.431 ms           6 m, 0 s             19.78 M
27330 (91.10%)      30.948 ms            1 m, 22 s            21.33 M
27340 (91.13%)      30.579 ms            1 m, 21 s            21.44 M
27350 (91.17%)      31.151 ms            1 m, 22 s            21.02 M
27351 (91.17%)      1.59 M               31.496 ms            1 m, 23 s            20.60 M
27360 (91.20%)      31.850 ms            1 m, 24 s            20.39 M
27370 (91.23%)      31.817 ms            1 m, 23 s            20.46 M
27380 (91.27%)      30.517 ms            1 m, 19 s            21.66 M
27390 (91.30%)      30.040 ms            1 m, 18 s            22.08 M
----------------------------------------------------------------------------------------------------
Viewer running locally at: http://localhost:7007 (listening on 0.0.0.0)
[10:28:14] Culled 1085 gaussians (1085 below alpha thresh, 0 too bigs, 2170163 remaining)              splatfacto.py:528
Printing profiling stats, from longest to shortest duration in seconds
VanillaPipeline.get_average_eval_image_metrics: 0.8929
VanillaPipeline.get_eval_image_metrics_and_images: 0.1083
Trainer.train_iteration: 0.0458
VanillaPipeline.get_train_loss_dict: 0.0368
Trainer.eval_iteration: 0.0020
Traceback (most recent call last):
  File "/home/user/.local/bin/ns-train", line 8, in <module>
    sys.exit(entrypoint())
  File "/home/user/nerfstudio/nerfstudio/scripts/train.py", line 262, in entrypoint
    main(
  File "/home/user/nerfstudio/nerfstudio/scripts/train.py", line 247, in main
    launch(
  File "/home/user/nerfstudio/nerfstudio/scripts/train.py", line 189, in launch
    main_func(local_rank=0, world_size=world_size, config=config)
  File "/home/user/nerfstudio/nerfstudio/scripts/train.py", line 100, in train_loop
    trainer.train()
  File "/home/user/nerfstudio/nerfstudio/engine/trainer.py", line 287, in train
    self.eval_iteration(step)
  File "/home/user/nerfstudio/nerfstudio/utils/decorators.py", line 70, in wrapper
    ret = func(self, *args, **kwargs)
  File "/home/user/nerfstudio/nerfstudio/utils/profiler.py", line 112, in inner
    out = func(*args, **kwargs)
  File "/home/user/nerfstudio/nerfstudio/engine/trainer.py", line 520, in eval_iteration
    metrics_dict, images_dict = self.pipeline.get_eval_image_metrics_and_images(step=step)
  File "/home/user/nerfstudio/nerfstudio/utils/profiler.py", line 112, in inner
    out = func(*args, **kwargs)
  File "/home/user/nerfstudio/nerfstudio/pipelines/base_pipeline.py", line 339, in get_eval_image_metrics_and_images
    outputs = self.model.get_outputs_for_camera(camera)
  File "/home/user/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/nerfstudio/nerfstudio/models/splatfacto.py", line 914, in get_outputs_for_camera
    outs = self.get_outputs(camera.to(self.device))
  File "/home/user/nerfstudio/nerfstudio/models/splatfacto.py", line 767, in get_outputs
    self.xys.retain_grad()
RuntimeError: can't retain_grad on Tensor that has requires_grad=False

A check on requires_grad might be missing.

To Reproduce Steps to reproduce the behavior:

  1. Run splatfacto training with for example:
    ns-train splatfacto --data data/nerfstudio/my_dataset --vis "viewer+tensorboard" --pipeline.model.rasterize-mode antialiased
    
  2. At some point, training might fail randomly with the error from above.

Expected behavior

No random error.

Screenshots N/A

Additional context N/A

bchretien avatar Feb 29 '24 10:02 bchretien

        # Important to allow xys grads to populate properly
        if self.training:
            try:
                self.xys.retain_grad()
            except Exception as e:
                print(e)
can't retain_grad on Tensor that has requires_grad=False
Exception in thread Thread-7:
Traceback (most recent call last):
  File "C:\Users\jyomu\scoop\persist\rye\py\[email protected]\install\lib\threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "E:\AI\nerfstudio\nerfstudio\viewer\render_state_machine.py", line 222, in run
    outputs = self._render_img(action.camera_state)
  File "E:\AI\nerfstudio\nerfstudio\viewer\render_state_machine.py", line 177, in _render_img
    assert len(outputs["depth"].shape) == 3
AttributeError: 'NoneType' object has no attribute 'shape'

jyomu avatar Mar 04 '24 16:03 jyomu

I'm facing the same issue when I run ns-train splatfacto. Is there any resolution for this?

InduCherukuri avatar Mar 19 '24 07:03 InduCherukuri

@jyomu @bchretien Facing the same issue ... any leads on cause of this ?

Exception4U avatar Mar 19 '24 08:03 Exception4U

@InduCherukuri / @Exception4U: as a simple workaround (but probably not solving the root cause):

# Important to allow xys grads to populate properly
if self.training and self.xys.requires_grad:
    self.xys.retain_grad()

bchretien avatar Mar 20 '24 20:03 bchretien

Do you have sample datasets that we can reproduce the error?

jb-ye avatar Mar 28 '24 21:03 jb-ye

@jb-ye: it happened a bit at random, mostly when I interacted with the viewer and modified the settings (e.g. resolution of the rendering). Alas I cannot share the associated dataset.

bchretien avatar Apr 08 '24 10:04 bchretien

I am experiencing the same issue. It appears that when the viewer renders the image, the model is not correctly set to eval mode.

sephyli avatar Apr 19 '24 03:04 sephyli