AdaBound icon indicating copy to clipboard operation
AdaBound copied to clipboard

Nan loss in RCAN model

Open Ken1256 opened this issue 5 years ago • 9 comments

https://github.com/wayne391/Image-Super-Resolution/blob/master/src/models/RCAN.py

Just change optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, amsgrad=False) to optimizer = adabound.AdaBound(model.parameters(), lr=1e-4, final_lr=0.1)

Nan loss in RCAN model, but Adam work fine.

Ken1256 avatar Mar 02 '19 14:03 Ken1256

Hi! Thanks for sharing the failure case! I will try to reproduce the result using your code. Do you know how much resource it needs for training?

Luolc avatar Mar 03 '19 01:03 Luolc

I find out if torch.nn.L1Loss(reduction='mean') AdaBound work fine, But torch.nn.L1Loss(reduction='sum') Nan loss. (Sorry, after double check the code I change reduction='mean' to reduction='sum', But Adam both work fine. Normally 'mean' or 'sum' should be the same.)

The resource depan img patch_size, set args.n_resgroups = 3 and args.n_resblocks = 2 will much faster and less VRAM.

Ken1256 avatar Mar 03 '19 14:03 Ken1256

Thanks for more details.

In this case, I guess AdaBound is a little bit sensitive on RCAN model, and a final_lr of 0.1 is too large. You may try some smaller final_lr like 0.03, 0.01, 0.003, and etc. But I am not familiar with this model and cannot make sure it would work.

Luolc avatar Mar 03 '19 14:03 Luolc

I try optimizer = adabound.AdaBound(model.parameters(), lr=1e-4, final_lr=1e-4) still Nan loss.

Ken1256 avatar Mar 03 '19 14:03 Ken1256

1e-4 might be too small ...

If I understand correctly, the only difference between mean and sum is a scale of N (the count of samples in a step). If AdaBound can work with mean, then reducing the learning rate with a scale of N should work too. But I am not sure whether it should be lr or final_lr or both. I just had a discussion with my schoolmates on a seminar today about which one is more important in training, the early stage or the final stage. However we haven't come out a clear answer yet. So we have to test it through experiments right now.

Luolc avatar Mar 03 '19 15:03 Luolc

1e-4 might be too small ...

If I understand correctly, the only difference between mean and sum is a scale of N (the count of samples in a step). If AdaBound can work with mean, then reducing the learning rate with a scale of N should work too. But I am not sure whether it should be lr or final_lr or both. I just had a discussion with my schoolmates on a seminar today about which one is more important in training, the early stage or the final stage. However we haven't come out a clear answer yet. So we have to test it through experiments right now.

Not exactly correct, suppose If dataset A has 101 data, and the batchsize is set as 10. If we setting the reduction as mean, There is no problem with this. Otherwise, once the last batch which has only one data, and it effect the learning rate

GreatGBL avatar Mar 04 '19 00:03 GreatGBL

I believe that's a very extreme case. Generally, a single step won't affect the whole training process, on expectation.

In this case, we would encounter a much less gradient once an epoch when using sum. If this does affect the training, I think the dataset is too small and SGD will fail either.

Luolc avatar Mar 04 '19 03:03 Luolc

hi, i use torch version 0.3.1. and just I modified optimizer = optim.Adam(params, weight_decay=conf.l2, lr=lr, eps=1e-3) to optimizer = adabound.AdaBound(params, weight_decay=conf.l2, lr=lr, final_lr=0.1, eps=1e-3)

when I ran it I faced raise ImportError("torch.utils.ffi is deprecated).

Would you help? Thanks

MitraTj avatar Apr 24 '19 10:04 MitraTj

hi, I‘m a beginner, and I have a small question about it: The adabound was inspired by gradient_clip while clipping happens on the lr rather than the gradient. So does it mean that I still need to clip the gradient before feeding it into optimizer to prevent the gradient becoming Nan?

Michael-J98 avatar Jul 04 '20 01:07 Michael-J98