face.evoLVe icon indicating copy to clipboard operation
face.evoLVe copied to clipboard

Training loss becomes nan after a few thousands iterations

Open cenalzw opened this issue 4 years ago • 4 comments

Thanks for your impressive contribution. I ran the 'train.py' script to directly train a ResNet50 model with 'msceleb_ align_112' training set. The loss functions I applied are ArcFace and Focal loss. However, after about 3000 iterations, the value of loss suddenly turned to nan. I have no idea where the bug may exist. Hope for your help!

Here is my soft and hard enviroment information: Python version: 3.6.5 PyTorch version: 1.3.0 cuda version: cuda10

GPU: TITAN X (Pascal)

cenalzw avatar May 07 '20 03:05 cenalzw

I met the same problem, I used mobilenet as backbone, and the training loss becomes Nan after two epoch. I had reduced the initial learning rate to 0.001.

ChaoLi977 avatar Jul 06 '20 02:07 ChaoLi977

I met the same problem, I used mobilenet as backbone, and the training loss becomes Nan after two epoch. I had reduced the initial learning rate to 0.001.

Do you deal with this problem. I meet the same problem using the Focal loss.

tuoniaoren avatar Jul 13 '20 06:07 tuoniaoren

@tuoniaoren @ChaoLi977 I have solved this problem. If you met this problem when using ArcFace loss. The only thing you need to do is to change the code 'sine = torch.sqrt(1.0 - torch.pow(cosine, 2))' to 'sine = torch.sqrt(torch.clamp((1.0 - torch.pow(cosine, 2)), 1e-9, 1))'. This could avoid the value falling into zero.

cenalzw avatar Jul 16 '20 01:07 cenalzw

@tuoniaoren @ChaoLi977 I have solved this problem. If you met this problem when using ArcFace loss. The only thing you need to do is to change the code 'sine = torch.sqrt(1.0 - torch.pow(cosine, 2))' to 'sine = torch.sqrt(torch.clamp((1.0 - torch.pow(cosine, 2)), 1e-9, 1))'. This could avoid the value falling into zero.

thanks.

tuoniaoren avatar Jul 16 '20 01:07 tuoniaoren