face.evoLVe
face.evoLVe copied to clipboard
Training loss becomes nan after a few thousands iterations
Thanks for your impressive contribution. I ran the 'train.py' script to directly train a ResNet50 model with 'msceleb_ align_112' training set. The loss functions I applied are ArcFace and Focal loss. However, after about 3000 iterations, the value of loss suddenly turned to nan. I have no idea where the bug may exist. Hope for your help!
Here is my soft and hard enviroment information: Python version: 3.6.5 PyTorch version: 1.3.0 cuda version: cuda10
GPU: TITAN X (Pascal)
I met the same problem, I used mobilenet as backbone, and the training loss becomes Nan after two epoch. I had reduced the initial learning rate to 0.001.
I met the same problem, I used mobilenet as backbone, and the training loss becomes Nan after two epoch. I had reduced the initial learning rate to 0.001.
Do you deal with this problem. I meet the same problem using the Focal loss.
@tuoniaoren @ChaoLi977 I have solved this problem. If you met this problem when using ArcFace loss. The only thing you need to do is to change the code 'sine = torch.sqrt(1.0 - torch.pow(cosine, 2))' to 'sine = torch.sqrt(torch.clamp((1.0 - torch.pow(cosine, 2)), 1e-9, 1))'. This could avoid the value falling into zero.
@tuoniaoren @ChaoLi977 I have solved this problem. If you met this problem when using ArcFace loss. The only thing you need to do is to change the code 'sine = torch.sqrt(1.0 - torch.pow(cosine, 2))' to 'sine = torch.sqrt(torch.clamp((1.0 - torch.pow(cosine, 2)), 1e-9, 1))'. This could avoid the value falling into zero.
thanks.