mmcv Potential bug in calc_square

https://github.com/open-mmlab/mmcv/blob/f527e43c1a1e52e5903410255aea1f71f005a161/mmcv/ops/points_sampler.py#L12 This function could return NaN, since the square dist could be negative in some cases.

for example:

a = torch.tensor([[[0.0000, 0.0000, 0.2188, 0.0000, 0.0000]]])
b = torch.tensor([[[0.0000, 0.0000, 0.2189, 0.0000, 0.0000]]])
calc_square_dist(a,b)

Aug 05 '22 05:08 chyohoo

Please @ZCMax have a look.

Aug 05 '22 07:08 zhouzaida

I can not reproduce the NaN result using your provided example. It calculates the square dist correctly.

Aug 05 '22 08:08 ZCMax

I can not reproduce the NaN result using your provided example. It calculates the square dist correctly.

this is very strange. I can reproduce neither. But I did encounter NaN while using it.

Aug 05 '22 08:08 chyohoo

Hi, what is your mmcv version? I can not reproduce the NaN result either.

Aug 09 '22 03:08 zhouzaida

nan_tensor.zip I save two tensors that will casues nan in the zip file. My mmcv version is

MMCV: 1.5.0 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 11.0

Aug 10 '22 06:08 chyohoo

This seems to be caused by a subtle numerical problem. Because there can be negative values around zero in the results of dist = a_square + b_square - 2 * corr_matrix, i.e., in dist, nan can be produced by computing sqrt(dist). A simple workaround is to add an small number epsilon to dist when computing its square root. Please @ZCMax have a look at whether this modification has other influence or has any effect on current related models.

Aug 17 '22 03:08 Tai-Wang

This seems to be caused by a subtle numerical problem. Because there can be negative values around zero in the results of dist = a_square + b_square - 2 * corr_matrix, i.e., in dist, nan can be produced by computing sqrt(dist). A simple workaround is to add an small number epsilon to dist when computing its square root. Please @ZCMax have a look at whether this modification has other influence or has any effect on current related models.

yep. using torch.cdist instead not seen nan so far

Aug 17 '22 05:08 chyohoo

Great, torch.cdist would be a better solution for this situation, contributions ( PR) are welcome if you have time after checking the performance influence of this modification.

Aug 17 '22 07:08 ZCMax

mmcv
mmcv copied to clipboard

Potential bug in calc_square_dist()

mmcv mmcv copied to clipboard

Potential bug in calc_square_dist()

mmcv
mmcv copied to clipboard