CompletionFormer RuntimeError: Function 'PowBackward0' returned nan values in its 0th output.

During the learning process, the following error occurs and learning is interrupted.

The only change I made was changing the gpus from [0,1,2,3] to [0].The following is a description of the problem, thank you！

Train | 231111@15:22:36 | Loss = 14.8048 | Lr Warm Up : [0.000405]: 41%|████ | 34797/85896 [5:09:27<7:34:25, 1.87it/s] Traceback (most recent call last): File "/home/user/download/pycharm-community-2023.1.1/plugins/python-ce/helpers/pydev/pydevd.py", line 1496, in _exec pydev_imports.execfile(file, globals, locals) # execute the script File "/home/user/download/pycharm-community-2023.1.1/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "/data/xyy/envs/CompletionFormer-main/src/main.py", line 446, in <module> main(args_main) File "/data/xyy/envs/CompletionFormer-main/src/main.py", line 421, in main while not spawn_context.join(): File "/data/conda/envs/user/envs/completionformer/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException: -- Process 0 terminated with the following error: Traceback (most recent call last): File "/data/conda/envs/user/envs/completionformer/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/data/xyy/envs/CompletionFormer-main/src/main.py", line 221, in train scaled_loss.backward() File "/data/conda/envs/user/envs/completionformer/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/data/conda/envs/user/envs/completionformer/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward Variable._execution_engine.run_backward( RuntimeError: Function 'PowBackward0' returned nan values in its 0th output. python-BaseException Backend TkAgg is interactive backend. Turning interactive mode on.

Nov 12 '23 03:11 xueking1

In addition, I found that during the test, the following error occurred. Did you accidentally write the parameter --save_image as --save-image?

The following is a description of the problem： main.py: error: unrecognized arguments: --save-image

Nov 12 '23 03:11 xueking1

Hi,

if you are using only one GPU, can you try this one:

python main.py --dir_data PATH_TO_NYUv2 --data_name NYU  --split_json ../data_json/nyu.json \
    --gpus 0 --loss 1.0*L1+1.0*L2 --batch_size 12 --milestones 36 48 56 64 72 --epochs 72 \
    --log_dir ../experiments/ --save NAME_TO_SAVE \

You're right, there is a typo, it should be --save_image. Thanks!

Best, Youmin

Nov 16 '23 09:11 youmi-zym

Okay, thank you! Is there any way to fix the problem of KITTI DC?

Nov 16 '23 10:11 xueking1

Indeed as you said, this problem was solved when I adjusted the gpus from [0] to [0,1]

Nov 17 '23 01:11 xueking1