heat icon indicating copy to clipboard operation
heat copied to clipboard

Lanzcos init introducing NaNs into DNDarray before torch.eig call

Open coquelin77 opened this issue 4 years ago • 5 comments

To Reproduce Steps to reproduce the behavior:

  1. Which module/class/function is affected?
    • spectral
  2. What are the circumstances under which the bug appears?
    • fitting the iris dataset. only happens on 7 processes tests sometimes. Travis fails, local machine does not
  3. What is the exact error message / erroneous behavior?
        V, T = ht.lanczos(L, self.n_lanczos, v0)

        # 4. Calculate and Sort Eigenvalues and Eigenvectors of tridiagonal matrix T
>       eval, evec = torch.eig(T._DNDarray__array, eigenvectors=True)

E       RuntimeError: invalid argument 1: A should not contain infs or NaNs at /pytorch/aten/src/TH/generic/THTensorLapack.cpp:208

Version Info Possibly occurring due to torch 1.6.0 release

coquelin77 avatar Aug 13 '20 11:08 coquelin77

I had exactly this issue multiple times. The bug seems not the be in the torch.eig but occurs somewhere in the lanczos iterations.

Since this only happens with specific configurations and parts of my dataset, I suspect numerical instabilities to be the case. A good approach to communicate the problem with the user, would be a check for inf/NaN after the lanczos iterations and throwing an error/warning that tells the user that numerical instabilities were encountered.

Possible fixes: Changes to the gamma of the RBF helped in my case.

sebimarkgraf avatar Feb 19 '21 11:02 sebimarkgraf

Is this still a problem @coquelin77 ?

ClaudiaComito avatar Apr 04 '22 09:04 ClaudiaComito

I cannot reproduce the error in

mpirun -np 7 python -m unittest -vf heat/cluster/tests/test_spectral.py

after removing the restriction to MPI.COMM_WORLD.size < 7...

mrfh92 avatar Aug 21 '23 10:08 mrfh92

Since I cannot reproduce the error anymore, I opened a PR to remove the restriction of the tests to <7 processes.

Independent of whether this works, reviewed within #1109

mrfh92 avatar Aug 21 '23 10:08 mrfh92