sparsesvd Invalid shape in axis

Hello!

I was wondering about the following behaviour, which made one of my tests break in a recent sparsesvd update.

>>> C = numpy.array([[ 0. ,  0.7,  0. ,  0. ],
                                 [ 1. ,  0. ,  0.7,  0. ],
                                 [ 0. ,  0.7,  0. ,  1. ],
                                 [ 0. ,  0. ,  0.7,  0. ]])
>>> K = scipy.sparse.csc_matrix(C)
>>> sparsesvd(K, 2)
ValueError: Invalid shape in axis 0: 0.

Not sure what is going on here - changing any value in this matrix seems to make it work.

Best, y

Mar 12 '13 10:03 moustaki

I can reproduce this in 0.1.9, thanks for the report y.

@alepulver this error doesn't happen in 0.1.8 -- any idea where the regression's coming from?

Mar 12 '13 18:03 piskvorky

This seems to happen with any matrix equivalent to numpy.diag(a,b,b,a) when a and b are lower than 20. I believe svdLAS2SA could be returning something different in these cases. What did it return in 0.1.8 where it worked?

Edit: found it. In these cases, srec.d is 0 (the length of "s". as returned by SVDLIBC). What should we do with the result?

Mar 12 '13 19:03 alepulver

0.1.8 returns

(array([[ 0.32372233,  0.4780341 ,  0.79283656,  0.19518564],
       [-0.19518564,  0.79283656, -0.4780341 ,  0.32372233]]),
 array([ 1.30002747,  1.30002747]),
 array([[ 0.36771077,  0.60121131,  0.36249528,  0.60986139],
       [ 0.60986139, -0.36249528,  0.60121131, -0.36771077]]))

(same machine)

Mar 12 '13 19:03 piskvorky

Could you please confirm that svdResult->d is 2, as it should be?

Edit: don't worry, I don't think that's the case. I'm getting d=0 and S=[0,0] from SVDLIBC. Either I'm calling it incorrectly, or something very wrong is going on.

Were there any special compilation flags I removed when switching to Cython?

Mar 12 '13 19:03 alepulver

When enabling SVDVerbose in svdlib.c, I get the following output when it works:

In [2]: sparsesvd.sparsesvd(scipy.sparse.csc_matrix(np.diag([1,15,15.1,1])), 2)
SOLVING THE [A^TA] EIGENPROBLEM
NO. OF ROWS               =      4
NO. OF COLUMNS            =      4
NO. OF NON-ZERO VALUES    =      4
MATRIX DENSITY            =  25.00%
MAX. NO. OF LANCZOS STEPS =      4
MAX. NO. OF EIGENPAIRS    =      2
LEFT  END OF THE INTERVAL = -1.00E-30
RIGHT END OF THE INTERVAL =  1.00E-30
KAPPA                     =  1.00E-06

NUMBER OF LANCZOS STEPS   =      4
RITZ VALUES STABILIZED    =      4
SINGULAR VALUES FOUND     =      2

And this when it doesn't.

In [3]: sparsesvd.sparsesvd(scipy.sparse.csc_matrix(np.diag([1,15,15,1])), 2)                                                         
SOLVING THE [A^TA] EIGENPROBLEM
NO. OF ROWS               =      4
NO. OF COLUMNS            =      4
NO. OF NON-ZERO VALUES    =      4
MATRIX DENSITY            =  25.00%
MAX. NO. OF LANCZOS STEPS =      4
MAX. NO. OF EIGENPAIRS    =      2
LEFT  END OF THE INTERVAL = -1.00E-30
RIGHT END OF THE INTERVAL =  1.00E-30
KAPPA                     =  1.00E-06

NUMBER OF LANCZOS STEPS   =      4
RITZ VALUES STABILIZED    =      0
SINGULAR VALUES FOUND     =      0

Do you have any ideas, @piskvorky ?

Mar 12 '13 20:03 alepulver

Heh, might well be connected to the "fewer factors than requested" discussed elsewhere...

@alepulver Is the output deterministic for you (0.1.8 works, 0.1.9 doesn't)? Or does it fail intermittently even on the same release?

Mar 12 '13 21:03 piskvorky

Sorry for the delay. I don't know if it's related, but at least 0.1.8 behaves the same in my machine. The only difference is that it doesn't throw an exception, but returns an array with invalid shape (so it's SVDLIBC returning no singular values).

This seems a SVDLIBC problem.

In [1]: import numpy as np, sparsesvd, scipy.sparse

In [2]: sparsesvd.sparsesvd(scipy.sparse.csc_matrix(np.diag([1,2,2,1])), 2)
Out[2]: 
(array([], shape=(0, 4), dtype=float64),
 array([], dtype=float64),
 array([], shape=(0, 4), dtype=float64))

In [3]: sparsesvd.sparsesvd(scipy.sparse.csc_matrix(np.diag([1,2,2.1,1])), 2)
Out[3]: 
(array([[ -2.58143374e-20,   8.98751972e-16,  -1.00000000e+00,
         -1.32169408e-17],
       [ -6.77626358e-21,  -1.00000000e+00,  -8.16013923e-16,
         -2.77555756e-17]]),
 array([ 2.1,  2. ]),
 array([[ -5.42101086e-20,   9.43689571e-16,  -1.00000000e+00,
         -2.77555756e-17],
       [ -1.35525272e-20,  -1.00000000e+00,  -7.77156117e-16,
         -5.55111512e-17]]))

Mar 20 '13 20:03 alepulver

Aha, ok, thanks. I'll have to investigate more thoroughly then :)

I'll check to see if vanilla SVDLIBC behaves the same way too (no Python), and then it's debugging time...

Mar 21 '13 21:03 piskvorky

I debugged, and the problem seems to be somewhere deep inside LAS2. The error bounds (bnd array coming out of lanso) are rubbish in face of algebraic multiplicity, although the ritz values are fine. The large bounds lead to ritz values being classified as "unstabilized" => errors later on.

This appears to be the root of all the other reported issues with sparsesvd as well: #1, #3 .

I tried comparing SVDLIBC (=Doug Rohde's version) against SVDPACKC (the original Michael Berry's version), but didn't notice any relevant difference in the code. So this possibly happens with SVDPACKC as well, which is really strange.

Apr 07 '13 18:04 piskvorky

I sent an email to Doug Rohde, hopefully he'll shed some light.

Apr 07 '13 18:04 piskvorky

Doug's reply:

"I haven't worked on SVDLIBC in years. And it was a port of someone else's code, so I never fully understood the majority of it. Mostly I was fixing bugs and improving memory management. So I probably can't offer that much about the algorithm, other than trying to read and understand the code."

Apr 08 '13 19:04 piskvorky

Thank you for spending so much time on this issue. I'm not really into numerical methods, so I may not be able to help. But maybe there is a paper about the algorithm, or newer ones. As for related Python modules I've found:

Divisi2 (which doesn't support Python 3.x) http://csc.media.mit.edu/docs/divisi2/

sklearn (which also includes recent references to papers) http://scikit-learn.org/stable/modules/decomposition.html

mlpy (it's not clear if it supports sparse matrices efficiently) http://mlpy.sourceforge.net/docs/3.5/dim_red.html#principal-component-analysis-pca

They mention PCA, but AFAIK it's related to the SVD of the normalized and rescaled matrix (or the covariance matrix), so I thought it might be useful.

Apr 09 '13 17:04 alepulver

Divisi2 uses SVDLIBC too, and therefore suffers from the same problems.

I already created and maintain a package for large-scale sparse SVD (gensim), so this is not a big issue for me. But to tell the truth, I am not thrilled about maintaining sparsesvd if it contains bugs that nobody understands and nobody can fix...

Would you like to become the principal maintainer of sparsesvd, @alepulver ?

Apr 09 '13 21:04 piskvorky

Or let's put it differently -- the person who can successfully resolve this issue will be crowned the king of sparsesvd :)

Apr 09 '13 21:04 piskvorky

I don't think I can debug it. Maybe an exception could be raised in those cases, or adding a tiny random noise to make it work (at least close to the real result). What do you think of this "hacky" solution?

Apr 10 '13 00:04 alepulver

I don't think that's a good idea @alepulver . Partly because we don't even know what "those cases" are.

I'll try to delve into the details and fix this, but I can't promise a date. For now (and the past 10 years?), it looks like the library will remain broken.

Apr 10 '13 19:04 piskvorky

Is this behaviour related ? I'm getting the very same (weird) result (no eigenvalues, empty U) with vanilla svdlibc (now I'm looking around for patches):

import numpy, scipy.sparse from sparsesvd import sparsesvd smat = scipy.sparse.csc_matrix((2, 2)) smat[1,1]=1.0 u, s, vt = sparsesvd(smat, 2) Traceback (most recent call last): File "", line 1, in File "sparsesvd.pyx", line 39, in sparsesvd.sparsesvd (sparsesvd.c:2239) File "stringsource", line 247, in View.MemoryView.array_cwrapper (sparsesvd.c:5951) File "stringsource", line 147, in View.MemoryView.array.cinit (sparsesvd.c:4887) ValueError: Invalid shape in axis 0: 0. u, s, vt = sparsesvd(smat, 1) Traceback (most recent call last): File "", line 1, in File "sparsesvd.pyx", line 39, in sparsesvd.sparsesvd (sparsesvd.c:2239) File "stringsource", line 247, in View.MemoryView.array_cwrapper (sparsesvd.c:5951) File "stringsource", line 147, in View.MemoryView.array.cinit (sparsesvd.c:4887) ValueError: Invalid shape in axis 0: 0. u, s, vt = sparsesvd(smat, 0) Traceback (most recent call last): File "", line 1, in File "sparsesvd.pyx", line 39, in sparsesvd.sparsesvd (sparsesvd.c:2239) File "stringsource", line 247, in View.MemoryView.array_cwrapper (sparsesvd.c:5951) File "stringsource", line 147, in View.MemoryView.array.cinit (sparsesvd.c:4887) ValueError: Invalid shape in axis 0: 0. smat[0,0]=1.0 u, s, vt = sparsesvd(smat, 2) u array([[ 3.12553483e-04, 9.99999951e-01]])

Jul 08 '13 13:07 pminervini

@pminervini almost certainly; let us know if you find something 8)

Jul 14 '13 21:07 piskvorky

@piskvorky Maybe you can be intersted in GraphLab (for distributed environments) and GraphChi (for multicore machines): http://graphlab.org/graphchi/ they both include an highly efficient implementation of SVD (and other algorithms); AFAIK, GraphChi allows you to do the same things as SVDLIBC (possibly more efficiently), without the need to dive deep into ancient C code

Jul 14 '13 21:07 pminervini

@pminervini GraphiChi sounds great, thanks for the link! Also check out gensim for fast SVD (gensim targets topic modelling though, not collab filtering).

Jul 16 '13 08:07 piskvorky

I had the "ValueError: Invalid shape in axis 0: 0." . It happens when I blend two matrixes and the one has vector weights as 0. I added a slight offset 0.0000001 and it works now.

May 15 '15 19:05 panamantis

sparsesvd sparsesvd copied to clipboard

Invalid shape in axis

sparsesvd
sparsesvd copied to clipboard