Traceback (most recent call last):
File "train.py", line 202, in
lr=args.lr, device=device, img_scale=args.scale, val_percent=args.val / 100)
File "train.py", line 95, in train_net
scaled_loss.backward()
File "D:\ProgramData\Anaconda3\envs\UNet3plus\lib\contextlib.py", line 119, in exit
next(self.gen)
File "D:\ProgramData\Anaconda3\envs\UNet3plus\lib\site-packages\apex\amp\handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "D:\ProgramData\Anaconda3\envs\UNet3plus\lib\site-packages\apex\amp_process_optimizer.py", line 249, in post_backward_no_master_weights
post_backward_models_are_masters(scaler, params, stashed_grads)
File "D:\ProgramData\Anaconda3\envs\UNet3plus\lib\site-packages\apex\amp_process_optimizer.py", line 135, in post_backward_models_are_masters
scale_override=(grads_have_scale, stashed_have_scale, out_scale))
File "D:\ProgramData\Anaconda3\envs\UNet3plus\lib\site-packages\apex\amp\scaler.py", line 183, in unscale_with_stashed
out_scale/grads_have_scale,
ZeroDivisionError: float division by zero
epochs跑到2次,就报这个错误,查到网上说将lr改小一个等级,就可以。我把lr从0.01 改成 0.001,到了26epoch又报这个错误。
请问,是否有其他方法消除这个错误?以及这个错误由什么引起的?谢谢