GPV_Pose RuntimeError: Function 'DotBackward0' returned nan values in its 0th output.

RuntimeError: Function 'DotBackward0' returned nan values in its 0th output.

Open Bingo-1996 opened this issue 2 years ago • 9 comments

Hi~ Thank you for releasing the code. When I run the training code, the loss will appear Nan after several epochs. I have tried three times and encountered the same problem. I did not modify any parameters. Can you give me some advice?

Screenshot from 2022-06-27 12-00-23

Jun 27 '22 04:06 Bingo-1996

Which version of code are you using? The main branch or the shape prior one?

Jun 27 '22 07:06 lolrudy

I encountered this error before. There are two potential reasons:

The sampled point cloud has no point since the ground truth mask is errorneous.
The bounding box voting process includes computing the inverse matrix. A non-full rank matrix yields the error. You might need to check the loss value before backward the loss and mask out NaN.

Jun 27 '22 07:06 lolrudy

Which version of code are you using? The main branch or the shape prior one?

I use the main branch

Jun 27 '22 07:06 Bingo-1996

Thank you for your advice. I will try it

Jun 27 '22 07:06 Bingo-1996

@Bingo-1996 can you post how you solved this? or have you solved it? :) thanks in advance

Jul 19 '22 06:07 HannahHaensen

@lolrudy and @Bingo-1996 like that?

# backward
shape = total_loss.shape
total_loss = total_loss.reshape(shape[0], -1)
# Drop all rows containing any nan:
total_loss = total_loss[~torch.any(total_loss.isnan(), dim=1)]
# Reshape back:
 total_loss = total_loss.reshape(total_loss.shape[0], *shape[1:])

Jul 21 '22 06:07 HannahHaensen

@HannahHaensen I haven't solved this problem, and there are other problems when I use the shape prior branch. Did you solve the problem?

Jul 21 '22 08:07 Bingo-1996

Yes, this should work. Or you can set all nan values to 0.

Jul 21 '22 08:07 lolrudy

@Bingo-1996 not sure yet this error occured for me after ~30 epochs not there again yet but if the training passes I can confirm or decline :) and i am on the main branch not the shape prior

@lolrudy thanks!

Jul 21 '22 08:07 HannahHaensen

GPV_Pose GPV_Pose copied to clipboard

RuntimeError: Function 'DotBackward0' returned nan values in its 0th output.

GPV_Pose
GPV_Pose copied to clipboard