nerf-pytorch
nerf-pytorch copied to clipboard
RuntimeError: Function 'PowBackward0' returned nan values in its 0th output.
During the learning process, the following error occurs and learning is interrupted.
[TRAIN] Iter: 40300 Loss: 0.011321269907057285 PSNR: 23.059185028076172
11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ 11%|█████████████████████▎ | 20356/180001 [1:25:30<11:07:17, 3.99it/s][W python_anomaly_mode.cpp:104] Warning: Error detected in PowBackward0. Traceback of forward call that caused the error:
File "run_nerf.py", line 858, in <module>
train()
File "run_nerf.py", line 751, in train
img_loss0 = img2mse(extras['rgb0'], target_s)
File "/app/nerf/run_nerf_helpers.py", line 12, in <lambda>
img2mse = lambda x, y : torch.mean((x - y) ** 2)
(function _print_stack)
11%|█████████████████████▎ | 20356/180001 [1:25:30<11:10:36, 3.97it/s]
Traceback (most recent call last):
File "run_nerf.py", line 858, in <module>
train()
File "run_nerf.py", line 755, in train
loss.backward()
File "/opt/conda/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
Variable._execution_engine.run_backward(
RuntimeError: Function 'PowBackward0' returned nan values in its 0th output.
Here's my configuration.
expname = mydata_test
basedir = ./logs
datadir = ./data/nerf_llff_data/mydata
dataset_type = llff
factor = 8
llffhold = 8
N_rand = 1024
N_samples = 64
N_importance = 64
use_viewdirs = True
raw_noise_std = 1e0
Maybe this can be fixed by adding eps here?
Have the same issue here, Any solutions?
Maybe this can be fixed by adding eps here? Sorry sir, what is eps?
eps means epsilon ε. it means very small value like 0.0000001
Do you have code to reproduce the error?
I saw m1kit was testing his own set and my error popped up when I was training my own set. I used Colmap to get the camera position info and it can run like 20k iterations, but it will stop randomly at a point saying RuntimeError: Function 'PowBackward0' returned nan values in its 0th output.
I tried fern sample set and no error at all (sometimes it shows GPU om, but no error when I reduced the settings). I did not change much in code except adding a ParallelData to use all four GPUs at the same time.
I just wondering what did m1kit do and waiting for his response.
Unfortunately, for personal reasons, I cannot provide the dataset that caused this error. To be honest, it was 4 months ago, so it's hard to remember how to reproduce it in detail. I apologize for not being able to help you.
Hello, I encountered the same problem when using SCNeRF, which borrows heavily from this repository, to train on custom data.
Data
The data can be accessed through this google drive link: https://drive.google.com/drive/folders/1SUzKMn6oD4inzN-m7RmHVl7gGEnq-Iv4?usp=sharing
Logs
[TRAIN] Iter: 209100 Loss: 0.006338230334222317 PSNR: 25.197158813476562
[TRAIN] Iter: 209200 Loss: 0.007395393215119839 PSNR: 24.48368263244629
[TRAIN] Iter: 209300 Loss: 0.007888318039476871 PSNR: 24.342876434326172
[TRAIN] Iter: 209400 Loss: 0.00826267059892416 PSNR: 24.05372428894043
[TRAIN] Iter: 209500 Loss: 0.0067442795261740685 PSNR: 24.944828033447266
Starts Validation Rendering
VAL PSNR 144: 22.382625579833984
Validation PRD : 0.4792793095111847
File "run_nerf.py", line 1052, in <module>
train()
File "run_nerf.py", line 506, in train
train_loss_0 = img2mse(extras['rgb0'], target_s)
File "/home/julius_m/code/SCNeRF/NeRF/run_nerf_helpers.py", line 10, in <lambda>
img2mse = lambda x, y : torch.mean((x - y) ** 2)
(function _print_stack)
26%|██████████████████████████████████████████▋ | 209573/800000 [8:00:36<22:34:01, 7.27it/s]
Traceback (most recent call last):
File "run_nerf.py", line 1052, in <module>
train()
File "run_nerf.py", line 606, in train
train_loss.backward()
File "/home/julius_m/miniconda3/envs/icn/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/julius_m/miniconda3/envs/icn/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward
Variable._execution_engine.run_backward(
RuntimeError: Function 'PowBackward0' returned nan values in its 0th output.
! [Numerical Error] rgb_map contains nan or inf.
! [Numerical Error] disp_map contains nan or inf.
! [Numerical Error] acc_map contains nan or inf.
! [Numerical Error] raw contains nan or inf.
! [Numerical Error] rgb0 contains nan or inf.
! [Numerical Error] disp0 contains nan or inf.
! [Numerical Error] acc0 contains nan or inf.
! [Numerical Error] z_std contains nan or inf.
Launch script
cd NeRF
python run_nerf.py \
--config configs/llff_data/lamp.txt \
--expname lamp \
--chunk 8192 \
--N_rand 1024 \
--camera_model pinhole_rot_noise_10k_rayo_rayd \
--ray_loss_type proj_ray_dist \
--multiplicative_noise True \
--i_ray_dist_loss 10 \
--grid_size 10 \
--ray_dist_loss_weight 0.0001 \
--N_iters 800001 \
--use_custom_optim True \
--ray_o_noise_scale 1e-3 \
--ray_d_noise_scale 1e-3 \
--non_linear_weight_decay 0.1 \
--add_ie 200000 \
--add_od 400000 \
--add_prd 600000
Config
Note: make sure to change the datadir
to where you downloaded the above data.
configs/llff_data/lamp.txt
expname = lamp
basedir = ./logs
datadir = <path_to_lamp_dir>/lamp
dataset_type = llff
factor = 8
llffhold = 8
N_rand = 1024
N_samples = 64
N_importance = 64
use_viewdirs = True
raw_noise_std = 1e0
I can confirm this problem is happening to me on https://github.com/apchenstu/mvsnerf, trying out with either the lego synthetic dataset, or the orchid llff dataset.
I'll try to see how to make this reproducible.
Hello, I encountered the same problem when using SCNeRF, which borrows heavily from this repository, to train on custom data.
Data
The data can be accessed through this google drive link: https://drive.google.com/drive/folders/1SUzKMn6oD4inzN-m7RmHVl7gGEnq-Iv4?usp=sharing
Hi @AugustasMacijauskas did you have any success training with your custom dataset?
@davodogster No, I lost my patience and moved on to other things. I was also having a hard time figuring out how to debug this efficiently, since training for a few hours before it crashes and then changing one line of code and seeing if that helps is not going to work.
If it is an error in the 0th output, that means your weights are still not fully updated so some values in some batch's predictions , during your first epoch are nans. So it's not your inputs, but your model predictions that are nans. Could be an overflow or underflow error. This will make any loss function give you a tensor(nan)
.What you can do is put a check for when loss is nan and let the weights adjust themselves
criterion = SomeLossFunc()
eps = 1e-6
loss = criterion(preds,targets)
if loss.isnan(): loss=eps
else: loss = loss.item()
loss = loss+ L1_loss + ...