involution icon indicating copy to clipboard operation
involution copied to clipboard

Nan

Open cymdhx opened this issue 3 years ago • 28 comments

when I use I meet image

cymdhx avatar Apr 01 '21 11:04 cymdhx

please specify the experimental details

d-li14 avatar Apr 02 '21 10:04 d-li14

请具体说明实验的细节。

我是将他加在了yolo网络上,将panet层的conv改成了involution,使用conv的时候不会nan,但是改成involution时出现了nan

cymdhx avatar Apr 02 '21 12:04 cymdhx

请具体说明实验的细节。

我是将他加在了yolo网络上,将panet层的conv改成了involution,使用conv的时候不会nan,但是改成involution时出现了nan

like this image

cymdhx avatar Apr 02 '21 12:04 cymdhx

请具体说明实验的细节。

我是将他加在了yolo网络上,将panet层的conv改成了involution,使用conv的时候不会nan,但是改成involution时出现了nan

大佬,有啥办法可以解决吗,我试过将loss调低但是没什么用

cymdhx avatar Apr 02 '21 12:04 cymdhx

You may try the gradient clipping method, which is also used sometimes when we train our detection models, for example, https://github.com/d-li14/involution/blob/main/det/configs/involution/retinanet_red50_neck_fpn_1x_coco.py#L8

d-li14 avatar Apr 02 '21 13:04 d-li14

你可以试试梯度裁剪方法,有时在我们训练检测模型时也会用到,例如,Https://github.com/d-li14/involution/blob/main/det/configs/involution/retinanet_red50_neck_fpn_1x_coco.py#L8

thank you so much

cymdhx avatar Apr 05 '21 06:04 cymdhx

你可以试试梯度裁剪方法,有时在我们训练检测模型时也会用到,例如,Https://github.com/d-li14/involution/blob/main/det/configs/involution/retinanet_red50_neck_fpn_1x_coco.py#L8

thank you so much

image 当我用了梯度裁剪好像还是会nan

cymdhx avatar Apr 05 '21 06:04 cymdhx

I replaced the conv in the resblock in the super resolution model "edsr" with involution, and i used the gradient method, but the loss is still inf.

songyonger avatar Apr 06 '21 09:04 songyonger

我用对合代替了超分辨率模型“edsr”中的conv,使用了梯度法,但损失仍然是inf。

你现在解决了吗

cymdhx avatar Apr 07 '21 02:04 cymdhx

还没有

songyonger avatar Apr 07 '21 03:04 songyonger

还没有

我也没有,可以讨论讨论

cymdhx avatar Apr 07 '21 03:04 cymdhx

还没有

我也没有,可以讨论讨论

请问一下你们现在解决了吗?

545088212 avatar Apr 14 '21 01:04 545088212

The loss of mine in the training set is fine, while in cv set, some batches are nan. It's definitely not gradient explosion. I don't know how to find the problem and debug.

NNPanNPU avatar Apr 14 '21 11:04 NNPanNPU

The loss of mine in the training set is fine, while in cv set, some batches are nan. It's definitely not gradient explosion. I don't know how to find the problem and debug.

Maybe your dataset is not pure?

songwaimai avatar Apr 15 '21 08:04 songwaimai

I also met this problem in generation task. I replaced the con 3x3 by involution, the loss in nan or inf.

songwaimai avatar Apr 15 '21 08:04 songwaimai

我在代任务中也遇到了这个问题。我将con3x3替换为对合,在NaN或INF中的损失。

我也没解决,所以我已经快要放弃使用involution了

cymdhx avatar Apr 15 '21 11:04 cymdhx

我在代任务中也遇到了这个问题。我将con3x3替换为对合,在NaN或INF中的损失。

我也没解决,所以我已经快要放弃使用involution了 I also tried the gradient clipping method, but the NAN problem is not be solved, i will try to find some else methods which may work out.

songwaimai avatar Apr 15 '21 15:04 songwaimai

我在代任务中也遇到了这个问题。我将con3x3替换为对合,在NaN或INF中的损失。

我也没解决,所以我已经快要放弃使用involution了 I also tried the gradient clipping method, but the NAN problem is not be solved, i will try to find some else methods which may work out.

I also tried the gradient clipping method too.But It didn't work.If you have any good methods, please share them, thank you

cymdhx avatar Apr 15 '21 15:04 cymdhx

https://github.com/d-li14/involution/issues/26#issuecomment-819443734

I also met the same problem when dealing with the pose estimation task.

lygsbw avatar Apr 15 '21 16:04 lygsbw

我在代任务中也遇到了这个问题。我将con3x3替换为对合,在NaN或INF中的损失。

我也没解决,所以我已经快要放弃使用involution了 I also tried the gradient clipping method, but the NAN problem is not be solved, i will try to find some else methods which may work out.

I also tried the gradient clipping method too.But It didn't work.If you have any good methods, please share them, thank you

ok

songwaimai avatar Apr 16 '21 05:04 songwaimai

image 我在使用involution替换RCAN中的CA模块时,loss也非常大

LJill avatar Apr 27 '21 01:04 LJill

I replace the standard conv with involution and added bn, then the loss seems normal.But the final result is worse than edsr baseline with bn layer,even though i added the parameters of the edsr-involution.Now i have given up. You can have a try and we can talk.@LJill

songyonger avatar Apr 27 '21 01:04 songyonger

I replace the standard conv with involution and added bn, then the loss seems normal.But the final result is worse than edsr baseline with bn layer,even though i added the parameters of the edsr-involution.Now i have given up. You can have a try and we can talk.@LJill

Thanks for your reply . I tried your method on EDSR and RCAN , it works , the loss is normal now . I will conduct experiments to observe the final result .

LJill avatar Apr 27 '21 02:04 LJill

I replace the standard conv with involution and added bn, then the loss seems normal.But the final result is worse than edsr baseline with bn layer,even though i added the parameters of the edsr-involution.Now i have given up. You can have a try and we can talk.@LJill

Thanks for your reply . I tried your method on EDSR and RCAN , it works , the loss is normal now . I will conduct experiments to observe the final result . when i replace the conv with involution and add BN, the train loss seems normal, but the val loss is NAN still, has this happened to your model?

songwaimai avatar Apr 29 '21 15:04 songwaimai

我换成involution,结果参数好像不能进行优化。train loss一直下降,但是val loss一直保持一个值没变。有大佬知道这是不是过拟合造成的,还是代码错误。 我感觉不是过拟,因为train loss下降,val loss基本没变。还没有解决这个问题

whf9527 avatar May 05 '21 15:05 whf9527

The loss of mine in the training set is fine, while in cv set, some batches are nan. It's definitely not gradient explosion. I don't know how to find the problem and debug.

what cause this problem ?? I also met this issue. train loss is better, but the val loss is unchange.

whf9527 avatar May 06 '21 04:05 whf9527

I implemented a pure PyTorch 2D involution and faced a similar issue of Nans occurring during training when using the involution as a plug-in replacement for convolutions. In my case this was caused by exploding activation. For me, the issue could be solved by utilizing a higher momentum (0.3) in the batch normalization (after reduction) layer. I guess the distribution of the activation change that much that batch norm, with track_running_stats=True and momentum=0.1, can not follow the changing distribution, resulting in exploding activations. This was my conclusion after looking at the PyTorch batch norm implementation, which uses also the running stats for normalization during training (correct me if I'm wrong).

ChristophReich1996 avatar May 14 '21 09:05 ChristophReich1996

when I use I meet image @cymdhx @songwaimai @whf9527

我解决了我遇到的nan问题,附上我的解决方法,不知道是否适用于你们的: 问题描述: Unet + resnet 改为 unet + rednet50时出现 nan,inf 解决方案: 把程序中的 以下代码去掉,不要人为初始化 weight and bias

def set_bn_init(m):
classname = m.__class__.__name__
if classname.find('BatchNorm') != -1:
m.weight.data.fill_(1.0)
m.bias.data.fill_(0.0)

I solved the nan problem I encountered, and attached my solution, I don’t know if it applies to yours: Problem description: When Unet + resnet is changed to unet + rednet50, nan and inf appear Solution: Remove the following code in the program, do not initialize weight and bias artificially

def set_bn_init(m):
classname = m.__class__.__name__
if classname.find('BatchNorm') != -1:
m.weight.data.fill_(1.0)
m.bias.data.fill_(0.0)

weiguangzhao avatar May 24 '21 08:05 weiguangzhao