nerfstudio
nerfstudio copied to clipboard
splatfacto: "RuntimeError: can't retain_grad on Tensor that has requires_grad=False"
Describe the bug
I randomly get this error during training with splatfacto
(tested on the c491e3e1 commit in Docker):
Step (% Done) Train Iter (time) ETA (time) Train Rays / Sec Test Rays / Sec
--------------------------------------------------------------------------------------------------------
27310 (91.03%) 135.259 ms 6 m, 3 s 19.69 M 4.11 M
27320 (91.07%) 134.431 ms 6 m, 0 s 19.78 M
27330 (91.10%) 30.948 ms 1 m, 22 s 21.33 M
27340 (91.13%) 30.579 ms 1 m, 21 s 21.44 M
27350 (91.17%) 31.151 ms 1 m, 22 s 21.02 M
27351 (91.17%) 1.59 M 31.496 ms 1 m, 23 s 20.60 M
27360 (91.20%) 31.850 ms 1 m, 24 s 20.39 M
27370 (91.23%) 31.817 ms 1 m, 23 s 20.46 M
27380 (91.27%) 30.517 ms 1 m, 19 s 21.66 M
27390 (91.30%) 30.040 ms 1 m, 18 s 22.08 M
----------------------------------------------------------------------------------------------------
Viewer running locally at: http://localhost:7007 (listening on 0.0.0.0)
[10:28:14] Culled 1085 gaussians (1085 below alpha thresh, 0 too bigs, 2170163 remaining) splatfacto.py:528
Printing profiling stats, from longest to shortest duration in seconds
VanillaPipeline.get_average_eval_image_metrics: 0.8929
VanillaPipeline.get_eval_image_metrics_and_images: 0.1083
Trainer.train_iteration: 0.0458
VanillaPipeline.get_train_loss_dict: 0.0368
Trainer.eval_iteration: 0.0020
Traceback (most recent call last):
File "/home/user/.local/bin/ns-train", line 8, in <module>
sys.exit(entrypoint())
File "/home/user/nerfstudio/nerfstudio/scripts/train.py", line 262, in entrypoint
main(
File "/home/user/nerfstudio/nerfstudio/scripts/train.py", line 247, in main
launch(
File "/home/user/nerfstudio/nerfstudio/scripts/train.py", line 189, in launch
main_func(local_rank=0, world_size=world_size, config=config)
File "/home/user/nerfstudio/nerfstudio/scripts/train.py", line 100, in train_loop
trainer.train()
File "/home/user/nerfstudio/nerfstudio/engine/trainer.py", line 287, in train
self.eval_iteration(step)
File "/home/user/nerfstudio/nerfstudio/utils/decorators.py", line 70, in wrapper
ret = func(self, *args, **kwargs)
File "/home/user/nerfstudio/nerfstudio/utils/profiler.py", line 112, in inner
out = func(*args, **kwargs)
File "/home/user/nerfstudio/nerfstudio/engine/trainer.py", line 520, in eval_iteration
metrics_dict, images_dict = self.pipeline.get_eval_image_metrics_and_images(step=step)
File "/home/user/nerfstudio/nerfstudio/utils/profiler.py", line 112, in inner
out = func(*args, **kwargs)
File "/home/user/nerfstudio/nerfstudio/pipelines/base_pipeline.py", line 339, in get_eval_image_metrics_and_images
outputs = self.model.get_outputs_for_camera(camera)
File "/home/user/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/user/nerfstudio/nerfstudio/models/splatfacto.py", line 914, in get_outputs_for_camera
outs = self.get_outputs(camera.to(self.device))
File "/home/user/nerfstudio/nerfstudio/models/splatfacto.py", line 767, in get_outputs
self.xys.retain_grad()
RuntimeError: can't retain_grad on Tensor that has requires_grad=False
A check on requires_grad
might be missing.
To Reproduce Steps to reproduce the behavior:
- Run
splatfacto
training with for example:ns-train splatfacto --data data/nerfstudio/my_dataset --vis "viewer+tensorboard" --pipeline.model.rasterize-mode antialiased
- At some point, training might fail randomly with the error from above.
Expected behavior
No random error.
Screenshots N/A
Additional context N/A
# Important to allow xys grads to populate properly
if self.training:
try:
self.xys.retain_grad()
except Exception as e:
print(e)
can't retain_grad on Tensor that has requires_grad=False
Exception in thread Thread-7:
Traceback (most recent call last):
File "C:\Users\jyomu\scoop\persist\rye\py\[email protected]\install\lib\threading.py", line 1016, in _bootstrap_inner
self.run()
File "E:\AI\nerfstudio\nerfstudio\viewer\render_state_machine.py", line 222, in run
outputs = self._render_img(action.camera_state)
File "E:\AI\nerfstudio\nerfstudio\viewer\render_state_machine.py", line 177, in _render_img
assert len(outputs["depth"].shape) == 3
AttributeError: 'NoneType' object has no attribute 'shape'
I'm facing the same issue when I run ns-train splatfacto. Is there any resolution for this?
@jyomu @bchretien Facing the same issue ... any leads on cause of this ?
@InduCherukuri / @Exception4U: as a simple workaround (but probably not solving the root cause):
# Important to allow xys grads to populate properly
if self.training and self.xys.requires_grad:
self.xys.retain_grad()
Do you have sample datasets that we can reproduce the error?
@jb-ye: it happened a bit at random, mostly when I interacted with the viewer and modified the settings (e.g. resolution of the rendering). Alas I cannot share the associated dataset.
I am experiencing the same issue. It appears that when the viewer renders the image, the model is not correctly set to eval mode.