PSMNet icon indicating copy to clipboard operation
PSMNet copied to clipboard

training loss = nan on custom dataset

Open passion3394 opened this issue 7 years ago • 10 comments

A slice training process is like the following:

Iter 496 training loss = 65.174 , time = 2.39 epoch 41 total training loss = nan Iter 497 training loss = 124.104 , time = 2.38 epoch 41 total training loss = nan Iter 498 training loss = 143.243 , time = 2.34 epoch 41 total training loss = nan Iter 499 training loss = 102.472 , time = 2.36 epoch 41 total training loss = nan Iter 500 training loss = 54.147 , time = 2.34 epoch 41 total training loss = nan Iter 501 training loss = 76.837 , time = 2.38 epoch 41 total training loss = nan Iter 502 training loss = 67.174 , time = 2.36 epoch 41 total training loss = nan Iter 503 training loss = 58.369 , time = 2.31 epoch 41 total training loss = nan Iter 504 training loss = 76.735 , time = 2.37 epoch 41 total training loss = nan Iter 505 training loss = 150.376 , time = 2.33 epoch 41 total training loss = nan Iter 506 training loss = 76.206 , time = 2.27 epoch 41 total training loss = nan Iter 0 3-px error in val = 94.037

Why? Could you give some advise?

passion3394 avatar Nov 22 '18 07:11 passion3394

@passion3394 Could you give more information about this training? Such as dataset, learning rate, ... etc.

JiaRenChang avatar Dec 02 '18 04:12 JiaRenChang

@JiaRenChang hi, I use the Apollo depth dataset to train the model, the learning rate is calculated by the following formular 'lr = 0.01 * 0.1 ** (epoch // 30)'. Recently, I found the apollo depth dataset didn't do the work of Epipolar Geometry between the left image and right image, I think it's the main reason for the appearing of nan. Is that right?

passion3394 avatar Dec 02 '18 11:12 passion3394

@passion3394 Yes, I thought about it too. It seems that there are only background depth maps in the Apollo dataset.

JiaRenChang avatar Dec 02 '18 14:12 JiaRenChang

@JiaRenChang Yes, the depth of movable objects have been eliminated. Dear author, I have three questions about the PSMNet, hope to get the answers from you: (1)If we use the depth map of only background to train, and try to apply the trained model to the images with movable objects, the precision would be terrible? (2)If the left image and the right image have different contrast ratio, the testing result would be worse than the two images with the same contrast ratio? (3)After doing the work of Epipolar Geometry, my left image and right image have difference of two pixels on the height direction, will that be a very bad factor of testing?

passion3394 avatar Dec 03 '18 07:12 passion3394

@passion3394 (1) I think that the precision will be pretty bad because the movable objects usually have large disparities. But the background depth maps usually have small disparities. The strong imbalance may cause the problem of generalization.

(2) and (3) We actually tried testing PSMNet on the "real world (required by web cameras, weak camera calibration, and outdoor)" image pairs. We still can achieve pretty good results.

JiaRenChang avatar Dec 04 '18 07:12 JiaRenChang

@JiaRenChang thanks for your reply. I have communicated with the workers of apollo dataset. They will release a disparity dataset similar with kitti on the apollo dataset, which contains more image pairs and disparity images and the disparity is much denser.

passion3394 avatar Dec 04 '18 09:12 passion3394

@passion3394 @JiaRenChang hello, I have also used apollo depth dataset to train, but I will get an error "IndexError: too many indices for tensor of dimension 3" can you help me

thank you very much!

hnsywangxin avatar Dec 11 '18 06:12 hnsywangxin

@hnsywangxin sorry, may be more error info?

passion3394 avatar Dec 11 '18 09:12 passion3394

@passion3394 thank you for your replay,my error like this: Traceback (most recent call last): File "finetune.py", line 266, in <module> main() File "finetune.py", line 231, in main loss = train(imgL_crop,imgR_crop, disp_crop_L) File "finetune.py", line 169, in train loss = 0.5*F.smooth_l1_loss(output1[mask], disp_true[mask], size_average=True) + 0.7*F.smooth_l1_loss(output2[mask], disp_true[mask], size_average=True) + F.smooth_l1_loss(output3[mask], disp_true[mask], size_average=True) IndexError: too many indices for tensor of dimension 3 my input is depth map of apollo dataset,my batch_size =4,other parameter was the same as original program

hnsywangxin avatar Dec 14 '18 09:12 hnsywangxin

@passion3394 thank you for your replay,my error like this: Traceback (most recent call last): File "finetune.py", line 266, in <module> main() File "finetune.py", line 231, in main loss = train(imgL_crop,imgR_crop, disp_crop_L) File "finetune.py", line 169, in train loss = 0.5*F.smooth_l1_loss(output1[mask], disp_true[mask], size_average=True) + 0.7*F.smooth_l1_loss(output2[mask], disp_true[mask], size_average=True) + F.smooth_l1_loss(output3[mask], disp_true[mask], size_average=True) IndexError: too many indices for tensor of dimension 3 my input is depth map of apollo dataset,my batch_size =4,other parameter was the same as original program

hi, hnsywangxin, have you find the solution? I have the same error

AddASecond avatar May 12 '21 02:05 AddASecond