tiny-cuda-nn
tiny-cuda-nn copied to clipboard
Backward error during running python example
Hi! Thanks for the great work!
I have successfully installed pytorch binding, but when I run the script samples/mlp_learning_an_image_pytorch.py, the following error appears:
Traceback (most recent call last):
File "samples/mlp_learning_an_image_pytorch.py", line 161, in <module>
loss.backward()
File "/mnt/sdg/anaconda3/envs/pytorch3d/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/mnt/sdg/anaconda3/envs/pytorch3d/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
Variable._execution_engine.run_backward(
File "/mnt/sdg/anaconda3/envs/pytorch3d/lib/python3.8/site-packages/torch/autograd/function.py", line 89, in apply
return self._forward_cls.backward(self, *args) # type: ignore
File "/mnt/sdg/anaconda3/envs/pytorch3d/lib/python3.8/site-packages/tinycudann-1.5-py3.8-linux-x86_64.egg/tinycudann/modules.py", line 49, in backward
input_grad, weight_grad = _module_function_backward.apply(ctx, doutput, input, params, output)
TypeError: _module_function_backwardBackward.forward: expected Tensor or tuple of Tensor (got NoneType) for return value 0
I am using pytorch 1.7.1, cuda 10.2, RTX 2080 Ti. Do you know how to fix this error? Thank you very much!
Hi, and thank you also for this fantastic contribution! :)
Just to note I can observe the same behaviour on a simple snippet of code as well the example mentioned above, in that it:
- runs fine on windows with an RTX 3090;
- fails with the same error on Ubuntu 20.04 with an older card (sm_60)
This is using pytorch LTS 1.8.2, CUDA 11.1, tiny-cuda-nn built from scratch using master at commit 5b1ff5e26b0809ac1b7618beaa54d397101d376d. Happens whether using FullyFused or Cutlass MLPS in FP32 or FP16 modes (with hash encoding that is). On the older card, tiny-cuda-nn correctly falls back to Cutlass, but then fails with the above error. Doesn't appear to be connected to any specific driver version on Ubuntu.
Happy to help debugging this if it can't be reproduced, and will comment here if I figure out what is happening.
Best, Stephan
I ran into the same problem with PyTorch 1.8.2 + tinycudann >= 1.5. It works fine with tinycudann 1.4.
I solved the problem by upgrading PyTorch to the latest version (1.11.0). Hope this helps.
@FrozenSilent @StephanGarbin
Updating to PyTorch 1.10.0 also solved my problem.
Hi, I encountered similar issue here. Sometimes it randomly crushes. However, I never met the same issue without using AMP+tiny-cuda-nn. Even if the architecture is the same.
My environment is PyTorch 1.11.0, tinycudann 1.5, cuda 11.3
The following is the Network configuration of my MLPs and I don't know why it happens.
self.sigma_net = tcnn.Network(
n_input_dims=self.in_channels_xyz,
n_output_dims=W+1,
network_config={
"otype": "FullyFusedMLP",
"activation": "ReLU",
"output_activation": "None",
"n_neurons": 64, # should be W
"n_hidden_layers": self.num_layers - 1, # num_layers-1
},
)
/root/miniconda3/envs/torch-ngp/lib/python3.8/site-packages/torch/autograd/__init__.py:173: UserWarning: Error detected in _module_functi
onBackward. Traceback of forward call that caused the error:
File "run_nerf.py", line 332, in <module>
File "run_nerf.py", line 323, in train
File "run_nerf.py", line 174, in train_nerf
loss, psnr = train_nerf_on_epoch(args, train_dl, H, W, focal, N_rand, optimizer, loss_func, global_step, render_kwargs_train, scaler)
File "run_nerf.py", line 55, in train_nerf_on_epoch
rgb, disp, acc, extras = render(H, W, focal, chunk=args.chunk, rays=batch_rays, retraw=True, img_idx=img_idx, **render_kwargs_train)
File "/home/shuaic/storage/nerf-pytorch-dev/script/models/rendering.py", line 261, in render
all_ret = batchify_rays(rays, chunk, **kwargs)
File "/home/shuaic/storage/nerf-pytorch-dev/script/models/rendering.py", line 206, in batchify_rays
ret = render_rays(rays_flat[i:i+chunk], **kwargs)
File "/home/shuaic/storage/nerf-pytorch-dev/script/models/rendering.py", line 119, in render_rays
raw = network_query_fn(pts, viewdirs, None, network_fn, 'coarse', False, test_time=test_time)
File "/home/shuaic/storage/nerf-pytorch-dev/script/models/nerfh_tcnn.py", line 288, in <lambda>
run_NeRFH_TCNN(inputs, viewdirs, ts, network_fn, typ=typ,
File "/home/shuaic/storage/nerf-pytorch-dev/script/models/nerfh_tcnn.py", line 394, in run_NeRFH_TCNN
out_chunks += [fn(inputs_flat[i: i+netchunk], input_dirs_flat[i:i+netchunk], output_transient=output_transient)]
File "/root/miniconda3/envs/torch-ngp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/shuaic/storage/nerf-pytorch-dev/script/models/nerfh_tcnn.py", line 250, in forward
density_outputs = self.density(x) # [65536, 3]
File "/home/shuaic/storage/nerf-pytorch-dev/script/models/nerfh_tcnn.py", line 162, in density
h = self.sigma_net(x)
File "/root/miniconda3/envs/torch-ngp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/root/miniconda3/envs/torch-ngp/lib/python3.8/site-packages/tinycudann/modules.py", line 119, in forward
output = _module_function.apply(
(Triggered internally at /opt/conda/conda-bld/pytorch_1646755903507/work/torch/csrc/autograd/python_anomaly_mode.cpp:104.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
47%|█████████████████████████████████████████████▉ | 235/501 [10:35<11:58, 2.70s/it]
Traceback (most recent call last):
File "run_nerf.py", line 332, in <module>
File "run_nerf.py", line 323, in train
File "run_nerf.py", line 174, in train_nerf
loss, psnr = train_nerf_on_epoch(args, train_dl, H, W, focal, N_rand, optimizer, loss_func, global_step, render_kwargs_train, scaler)
File "run_nerf.py", line 74, in train_nerf_on_epoch
scaler.scale(loss).backward()
File "/root/miniconda3/envs/torch-ngp/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/root/miniconda3/envs/torch-ngp/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Function '_module_functionBackward' returned nan values in its 0th output.
I keep getting the exact same error with pytorch==1.13.1+cu117. This happens randomly roughly once every 7-8 minutes of training.
The reason is probably you have encountered overflow/underflow in your pipeline somewhere in the next steps of tcnn.
The float16’s max value from the tcnn library is probably 65504-ish. If I added a layer behind the output of tcnn MLP, such as torch.softplus, it will overflow since the backpropagation gradient needs to be computed with the output of tcnn MLP, if I remember correctly. And therefore it gives nan values.
The way I find this out is by try-except to catch the error and finding where the gradient became nan. The reason I got this error is because I had an exp operation behind the tcnn MLPs. Hope this helps to you.