mmcv icon indicating copy to clipboard operation
mmcv copied to clipboard

Potential bug in calc_square_dist()

Open chyohoo opened this issue 3 years ago • 8 comments

https://github.com/open-mmlab/mmcv/blob/f527e43c1a1e52e5903410255aea1f71f005a161/mmcv/ops/points_sampler.py#L12 This function could return NaN, since the square dist could be negative in some cases.

for example:

a = torch.tensor([[[0.0000, 0.0000, 0.2188, 0.0000, 0.0000]]])
b = torch.tensor([[[0.0000, 0.0000, 0.2189, 0.0000, 0.0000]]])
calc_square_dist(a,b)

chyohoo avatar Aug 05 '22 05:08 chyohoo

Please @ZCMax have a look.

zhouzaida avatar Aug 05 '22 07:08 zhouzaida

I can not reproduce the NaN result using your provided example. It calculates the square dist correctly.

ZCMax avatar Aug 05 '22 08:08 ZCMax

I can not reproduce the NaN result using your provided example. It calculates the square dist correctly.

this is very strange. I can reproduce neither. But I did encounter NaN while using it. image

chyohoo avatar Aug 05 '22 08:08 chyohoo

Hi, what is your mmcv version? I can not reproduce the NaN result either.

image

zhouzaida avatar Aug 09 '22 03:08 zhouzaida

nan_tensor.zip I save two tensors that will casues nan in the zip file. My mmcv version is

MMCV: 1.5.0 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 11.0

chyohoo avatar Aug 10 '22 06:08 chyohoo

This seems to be caused by a subtle numerical problem. Because there can be negative values around zero in the results of dist = a_square + b_square - 2 * corr_matrix, i.e., in dist, nan can be produced by computing sqrt(dist). A simple workaround is to add an small number epsilon to dist when computing its square root. Please @ZCMax have a look at whether this modification has other influence or has any effect on current related models.

Tai-Wang avatar Aug 17 '22 03:08 Tai-Wang

This seems to be caused by a subtle numerical problem. Because there can be negative values around zero in the results of dist = a_square + b_square - 2 * corr_matrix, i.e., in dist, nan can be produced by computing sqrt(dist). A simple workaround is to add an small number epsilon to dist when computing its square root. Please @ZCMax have a look at whether this modification has other influence or has any effect on current related models.

yep. using torch.cdist instead not seen nan so far

chyohoo avatar Aug 17 '22 05:08 chyohoo

Great, torch.cdist would be a better solution for this situation, contributions ( PR) are welcome if you have time after checking the performance influence of this modification.

ZCMax avatar Aug 17 '22 07:08 ZCMax