meta-prompts
meta-prompts copied to clipboard
Nan loss on the half way to train the model.
I am using your framework on image translation task.
The loss was fine at the very begining, but was changed into Nan in Epoch 006. Have you ever solved this problem?
The log infomation:
Epoch: 005 - 025Epoch: [5][0/1250] Loss: 0.5269870162010193, LR: 0.00020489077162409578
Epoch: [5][800/1250] Loss: 0.44976134161080017, LR: 0.00022987952270145035
Epoch: [5][900/1250] Loss: 0.44960858015453115, LR: 0.00023297791072928454
Epoch: [5][1000/1250] Loss: 0.4489031859657743, LR: 0.00023607105232487043
Epoch: [5][1100/1250] Loss: 0.45127786015186605, LR: 0.00023915904358565203
Epoch: [5][1200/1250] Loss: 0.45215647128549447, LR: 0.00024224197730360856
Epoch: 005 - 025
====================================================================================================
d1 d2 d3 abs_rel sq_rel rmse rmse_log log10 silog
0.2776 0.6083 0.7993 0.8979 77.8871 60.1445 0.6755 0.2110 0.6634
====================================================================================================
Epoch: 006 - 025Epoch: [6][0/1250] Loss: 0.4362526535987854, LR: 0.00024378157571854745
Epoch: [6][100/1250] Loss: nan, LR: 0.0002468570902802666
Epoch: [6][200/1250] Loss: nan, LR: 0.0002499277658752283
Epoch: [6][900/1250] Loss: nan, LR: 0.00027129361225600624
Epoch: [6][1000/1250] Loss: nan, LR: 0.0002743283372183186
Epoch: [6][1100/1250] Loss: nan, LR: 0.00027735887967590467
Epoch: [6][1200/1250] Loss: nan, LR: 0.0002803853021768897
The output goes:
NaN or Inf found in input tensor.
====================================================================================================
d1 d2 d3 abs_rel sq_rel rmse rmse_log log10 silog
0.0000 0.0000 0.0000 0.9861 131.8554 146.8084 11.4800 4.9767 8.1322
====================================================================================================
Epoch: 009 - 025
Epoch: [9][0/1250] Loss: nan, LR: 0.0003563283532129068
Epoch: [9][1200/1250] Loss: nan, LR: 0.00039136562872899835
NaN or Inf found in input tensor.
nan
nan
nan
nan
nan
It seems to be something wrong, but the inputs are right at the beging.
I've just dopuble checked the dataset, it's right.
from torch.utils.data import DataLoader
import tqdm
dataloader = DataLoader(dataset, batch_size=4, shuffle=True, num_workers=2)
for i, x in enumerate(dataloader):
print(f'Batch {i}:')
print(x['image'].shape, x['depth'].shape)
# print('Data:', data.shape)
# print('Label:', label)
Reducing the learning rate or increasing the weight decay might solve this problem, but it could also impact performance.