statsample-glm
statsample-glm copied to clipboard
NotRegularMatrix exception for certain dataframes
Statsample::GLM.compute is failing for certain dataframes.
> try = Daru::DataFrame.from_csv 'try.csv'
> Statsample::GLM.compute try, 'y', :logistic
ExceptionForMatrix::ErrNotRegular: Not Regular Matrix
from /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/backports-3.6.8/lib/backports/1.9.2/stdlib/matrix.rb:933:in `block in inverse_from'
Get dataframe used in the above code here
Weird bug.
I think it's failing because a matrix inverse is being computed, and possibly the determinant is very close to zero which is why it's that ErrNotRegular. If I'm right, changing the matrix inverse computation algorithm should make it work.
Here's some info I found.
I printed all the matrices whose inverse the algorithm was computing. Here's the result:
...
Matrix[[-8.459899447643453e-14, -5.75239855749016e-12], [-5.75239855749016e-12, -10927.800950741155]]
Matrix[[-3.1308289294429086e-14, -2.128675014034775e-12], [-2.128675014034775e-12, -10927.800950740906]]
Matrix[[-1.1546319456101584e-14, -7.842171356742226e-13], [-7.842171356742226e-13, -10927.800950740813]]
Matrix[[-4.218847493575589e-15, -2.865041537347675e-13], [-2.865041537347675e-13, -10927.800950740779]]
Matrix[[-1.3322676295501873e-15, -8.997247391562266e-14], [-8.997247391562266e-14, -10927.800950740766]]
Matrix[[-6.661338147750937e-16, -4.4986236957811335e-14], [-4.4986236957811335e-14, -10927.800950740762]]
Matrix[[-0.0, -0.0], [-0.0, -10927.80095074076]]
ExceptionForMatrix::ErrNotRegular: Not Regular Matrix
from /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/backports-3.6.8/lib/backports/1.9.2/stdlib/matrix.rb:933:in `block in inverse_from'
In the end it is computing inverse of Matrix[[-0.0, -0.0], [-0.0, -10927.80095074076]] which is not possible.
@agisga might this be an issue with the algorithm or is it loss of precision in some of the calculations?
It seems to me that the algorithm is theoretically okay, because it gives correct results most of the time. Maybe it fails because it accumulates numerical error quickly, when the input matrix is not well conditioned.
Especially, since you mention matrix inverses, it sounds to me like the algorithm is not well optimized. It should be changed such that instead of computing matrix inverses, linear systems are solved (here is a very concise summary why). Solving a linear system is faster and numerically more stable than finding a matrix inverse.
Unfortunately right now I don't have the time to look at the algorithm in detail. I hope I can find the time to look at the algorithm in detail eventually. Probably it would be best to rewrite it such that it utilizes matrix decompositions and linear solvers provided by nmatrix-lapacke.
Thanks for the explanation. I'm getting the same thing in case another example is helpful. Data is available here: https://dl.dropboxusercontent.com/u/97188721/recruitment_failures.csv
data = Daru::DataFrame.from_csv 'recruitment_failures.csv'
glm = Statsample::GLM.compute data, 'failed_recruitment', :logistic