tiny-cuda-nn icon indicating copy to clipboard operation
tiny-cuda-nn copied to clipboard

Backward error during running python example

Open FrozenSilent opened this issue 3 years ago • 4 comments

Hi! Thanks for the great work! I have successfully installed pytorch binding, but when I run the script samples/mlp_learning_an_image_pytorch.py, the following error appears:

Traceback (most recent call last):
  File "samples/mlp_learning_an_image_pytorch.py", line 161, in <module>
    loss.backward()
  File "/mnt/sdg/anaconda3/envs/pytorch3d/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/mnt/sdg/anaconda3/envs/pytorch3d/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
    Variable._execution_engine.run_backward(
  File "/mnt/sdg/anaconda3/envs/pytorch3d/lib/python3.8/site-packages/torch/autograd/function.py", line 89, in apply
    return self._forward_cls.backward(self, *args)  # type: ignore
  File "/mnt/sdg/anaconda3/envs/pytorch3d/lib/python3.8/site-packages/tinycudann-1.5-py3.8-linux-x86_64.egg/tinycudann/modules.py", line 49, in backward
    input_grad, weight_grad = _module_function_backward.apply(ctx, doutput, input, params, output)
TypeError: _module_function_backwardBackward.forward: expected Tensor or tuple of Tensor (got NoneType) for return value 0

I am using pytorch 1.7.1, cuda 10.2, RTX 2080 Ti. Do you know how to fix this error? Thank you very much!

FrozenSilent avatar Apr 20 '22 13:04 FrozenSilent

Hi, and thank you also for this fantastic contribution! :)

Just to note I can observe the same behaviour on a simple snippet of code as well the example mentioned above, in that it:

  • runs fine on windows with an RTX 3090;
  • fails with the same error on Ubuntu 20.04 with an older card (sm_60)

This is using pytorch LTS 1.8.2, CUDA 11.1, tiny-cuda-nn built from scratch using master at commit 5b1ff5e26b0809ac1b7618beaa54d397101d376d. Happens whether using FullyFused or Cutlass MLPS in FP32 or FP16 modes (with hash encoding that is). On the older card, tiny-cuda-nn correctly falls back to Cutlass, but then fails with the above error. Doesn't appear to be connected to any specific driver version on Ubuntu.

Happy to help debugging this if it can't be reproduced, and will comment here if I figure out what is happening.

Best, Stephan

StephanGarbin avatar Apr 22 '22 09:04 StephanGarbin

I ran into the same problem with PyTorch 1.8.2 + tinycudann >= 1.5. It works fine with tinycudann 1.4.

I solved the problem by upgrading PyTorch to the latest version (1.11.0). Hope this helps.

@FrozenSilent @StephanGarbin

bennyguo avatar Apr 25 '22 07:04 bennyguo

Updating to PyTorch 1.10.0 also solved my problem.

FrozenSilent avatar Apr 25 '22 07:04 FrozenSilent

Hi, I encountered similar issue here. Sometimes it randomly crushes. However, I never met the same issue without using AMP+tiny-cuda-nn. Even if the architecture is the same.

My environment is PyTorch 1.11.0, tinycudann 1.5, cuda 11.3

The following is the Network configuration of my MLPs and I don't know why it happens.

      self.sigma_net = tcnn.Network(
            n_input_dims=self.in_channels_xyz,
            n_output_dims=W+1,
            network_config={
                "otype": "FullyFusedMLP",
                "activation": "ReLU",
                "output_activation": "None",
                "n_neurons": 64, # should be W
                "n_hidden_layers": self.num_layers - 1, # num_layers-1
            },
        )
/root/miniconda3/envs/torch-ngp/lib/python3.8/site-packages/torch/autograd/__init__.py:173: UserWarning: Error detected in _module_functi
onBackward. Traceback of forward call that caused the error:                                                                             
  File "run_nerf.py", line 332, in <module>                                                                                              
  File "run_nerf.py", line 323, in train                                                                                                 
  File "run_nerf.py", line 174, in train_nerf                                                                                            
    loss, psnr = train_nerf_on_epoch(args, train_dl, H, W, focal, N_rand, optimizer, loss_func, global_step, render_kwargs_train, scaler)
  File "run_nerf.py", line 55, in train_nerf_on_epoch                                                                                    
    rgb, disp, acc, extras = render(H, W, focal, chunk=args.chunk, rays=batch_rays, retraw=True, img_idx=img_idx, **render_kwargs_train) 
  File "/home/shuaic/storage/nerf-pytorch-dev/script/models/rendering.py", line 261, in render                                           
    all_ret = batchify_rays(rays, chunk, **kwargs)                                                                                       
  File "/home/shuaic/storage/nerf-pytorch-dev/script/models/rendering.py", line 206, in batchify_rays                                    
    ret = render_rays(rays_flat[i:i+chunk], **kwargs)                                                                                    
  File "/home/shuaic/storage/nerf-pytorch-dev/script/models/rendering.py", line 119, in render_rays                                      
    raw = network_query_fn(pts, viewdirs, None, network_fn, 'coarse', False, test_time=test_time)                                        
  File "/home/shuaic/storage/nerf-pytorch-dev/script/models/nerfh_tcnn.py", line 288, in <lambda>                                        
    run_NeRFH_TCNN(inputs, viewdirs, ts, network_fn, typ=typ,                                                                            
  File "/home/shuaic/storage/nerf-pytorch-dev/script/models/nerfh_tcnn.py", line 394, in run_NeRFH_TCNN                                  
    out_chunks += [fn(inputs_flat[i: i+netchunk], input_dirs_flat[i:i+netchunk], output_transient=output_transient)]                     
  File "/root/miniconda3/envs/torch-ngp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl                
    return forward_call(*input, **kwargs)                                                                                                
  File "/home/shuaic/storage/nerf-pytorch-dev/script/models/nerfh_tcnn.py", line 250, in forward                                         
    density_outputs = self.density(x) # [65536, 3]                                                                                       
  File "/home/shuaic/storage/nerf-pytorch-dev/script/models/nerfh_tcnn.py", line 162, in density                                         
    h = self.sigma_net(x)                                                                                                                
  File "/root/miniconda3/envs/torch-ngp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl                
    return forward_call(*input, **kwargs)                                                                                                
  File "/root/miniconda3/envs/torch-ngp/lib/python3.8/site-packages/tinycudann/modules.py", line 119, in forward                         
    output = _module_function.apply(                                                                                                     
 (Triggered internally at  /opt/conda/conda-bld/pytorch_1646755903507/work/torch/csrc/autograd/python_anomaly_mode.cpp:104.)             
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass                                         
 47%|█████████████████████████████████████████████▉                                                    | 235/501 [10:35<11:58,  2.70s/it]
Traceback (most recent call last): 
  File "run_nerf.py", line 332, in <module>
  File "run_nerf.py", line 323, in train
  File "run_nerf.py", line 174, in train_nerf
    loss, psnr = train_nerf_on_epoch(args, train_dl, H, W, focal, N_rand, optimizer, loss_func, global_step, render_kwargs_train, scaler)
  File "run_nerf.py", line 74, in train_nerf_on_epoch
    scaler.scale(loss).backward()
  File "/root/miniconda3/envs/torch-ngp/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/root/miniconda3/envs/torch-ngp/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function '_module_functionBackward' returned nan values in its 0th output.

chenusc11 avatar Jul 01 '22 00:07 chenusc11

I keep getting the exact same error with pytorch==1.13.1+cu117. This happens randomly roughly once every 7-8 minutes of training.

LucaBonfiglioli avatar Mar 06 '23 11:03 LucaBonfiglioli

The reason is probably you have encountered overflow/underflow in your pipeline somewhere in the next steps of tcnn.

The float16’s max value from the tcnn library is probably 65504-ish. If I added a layer behind the output of tcnn MLP, such as torch.softplus, it will overflow since the backpropagation gradient needs to be computed with the output of tcnn MLP, if I remember correctly. And therefore it gives nan values.

The way I find this out is by try-except to catch the error and finding where the gradient became nan. The reason I got this error is because I had an exp operation behind the tcnn MLPs. Hope this helps to you.

chenusc11 avatar Mar 06 '23 14:03 chenusc11