heat
heat copied to clipboard
Lanzcos init introducing NaNs into DNDarray before torch.eig call
To Reproduce Steps to reproduce the behavior:
- Which module/class/function is affected?
- spectral
- What are the circumstances under which the bug appears?
- fitting the iris dataset. only happens on 7 processes tests sometimes. Travis fails, local machine does not
- What is the exact error message / erroneous behavior?
V, T = ht.lanczos(L, self.n_lanczos, v0)
# 4. Calculate and Sort Eigenvalues and Eigenvectors of tridiagonal matrix T
> eval, evec = torch.eig(T._DNDarray__array, eigenvectors=True)
E RuntimeError: invalid argument 1: A should not contain infs or NaNs at /pytorch/aten/src/TH/generic/THTensorLapack.cpp:208
Version Info Possibly occurring due to torch 1.6.0 release
I had exactly this issue multiple times.
The bug seems not the be in the torch.eig
but occurs somewhere in the lanczos iterations.
Since this only happens with specific configurations and parts of my dataset, I suspect numerical instabilities to be the case.
A good approach to communicate the problem with the user, would be a check for inf/NaN
after the lanczos iterations and throwing an error/warning that tells the user that numerical instabilities were encountered.
Possible fixes: Changes to the gamma of the RBF helped in my case.
Is this still a problem @coquelin77 ?
I cannot reproduce the error in
mpirun -np 7 python -m unittest -vf heat/cluster/tests/test_spectral.py
after removing the restriction to MPI.COMM_WORLD.size < 7...
Since I cannot reproduce the error anymore, I opened a PR to remove the restriction of the tests to <7 processes.
Independent of whether this works, reviewed within #1109