nerfactor icon indicating copy to clipboard operation
nerfactor copied to clipboard

gradient error in Joint Optimization

Open hongsiyu opened this issue 2 years ago • 6 comments

I train successfully in shape pre-training but stuck in joint optimization. 2022-09-27 02:30:25.358618: E tensorflow/core/kernels/check_numerics_op.cc:289] abnormal_detected_host @0x7f43f6808a00 = {1, 0} Not a number (NaN) or infinity (Inf) values detected in gradient. b'Albedo' tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: Not a number (NaN) or infinity (Inf) values detected in gradient. b'Albedo' : Tensor had NaN values [[node gradient_tape/model/CheckNumerics (defined at tmp/tmp398ckawp.py:22) ]] [[Identity_6/_372]] (1) Invalid argument: Not a number (NaN) or infinity (Inf) values detected in gradient. b'Albedo' : Tensor had NaN values [[node gradient_tape/model/CheckNumerics (defined at tmp/tmp398ckawp.py:22) ]] 0 successful operations. 0 derived errors ignored. [Op:__inference_distributed_train_step_45946]

hongsiyu avatar Sep 26 '22 10:09 hongsiyu

I use my own data which's cameras are calculated by colmap.

hongsiyu avatar Sep 26 '22 10:09 hongsiyu

I use my own data which's cameras are calculated by colmap.

Trun down learning rate. Same as you, I trained my own data created by Blender, when I use the default learning rate(5e-3), I got the same ERROR as you, when I turn down the learning rate to 5e-4, everything is ok.

Jiangyu1181 avatar Sep 28 '22 07:09 Jiangyu1181

I use my own data which's cameras are calculated by colmap.

Trun down learning rate. Same as you, I trained my own data created by Blender, when I use the default learning rate(5e-3), I got the same ERROR as you, when I turn down the learning rate to 5e-4, everything is ok.

I have set lr at 5e-4 and 5e-5, and still met same error.

hongsiyu avatar Sep 29 '22 03:09 hongsiyu

I use my own data which's cameras are calculated by colmap.

Trun down learning rate. Same as you, I trained my own data created by Blender, when I use the default learning rate(5e-3), I got the same ERROR as you, when I turn down the learning rate to 5e-4, everything is ok.

I have set lr at 5e-4 and 5e-5, and still met same error.

Did you override lr in config_override of Joint Optimization (training and validation) ? e.g. --config_override="lr=$lr".

Jiangyu1181 avatar Sep 29 '22 08:09 Jiangyu1181

I use my own data which's cameras are calculated by colmap.

Trun down learning rate. Same as you, I trained my own data created by Blender, when I use the default learning rate(5e-3), I got the same ERROR as you, when I turn down the learning rate to 5e-4, everything is ok.

I have set lr at 5e-4 and 5e-5, and still met same error.

Did you override lr in config_override of Joint Optimization (training and validation) ? e.g. --config_override="lr=$lr".

Yep, I directly change lr in config of shape_mvs.ini

hongsiyu avatar Sep 29 '22 12:09 hongsiyu

I use my own data which's cameras are calculated by colmap.

Trun down learning rate. Same as you, I trained my own data created by Blender, when I use the default learning rate(5e-3), I got the same ERROR as you, when I turn down the learning rate to 5e-4, everything is ok.

I have set lr at 5e-4 and 5e-5, and still met same error.

Did you override lr in config_override of Joint Optimization (training and validation) ? e.g. --config_override="lr=$lr".

Yep, I directly change lr in config of shape_mvs.ini

and nerfactor_mvs.ini

hongsiyu avatar Sep 30 '22 02:09 hongsiyu