graphtools icon indicating copy to clipboard operation
graphtools copied to clipboard

Inaccuracy in randomized SVD at moderate ranks

Open stanleyjs opened this issue 2 years ago • 7 comments

Hi,

Randomized SVD is not accurate

Currently most (if not all) of the PCA / linear dimensionality reduction / SVD is first routed through either TruncatedSVD or PCA(svd_solver='randomized'). It turns out that this solver can be pretty bad at computing even moderate rank SVDs. Consider this pathological example in which we create a 1000 x 500 matrix with np.hstack([np.zeros(249,),np.arange(250,501)]) as its spectrum. The matrix is rank 250. We will also consider its rank-50 reconstruction and its rank 1 approximation.

import numpy as np
from sklearn.decomposition import TruncatedSVD

L2_sv_error = np.zeros(10,)
F_lowrank_error = np.zeros(10,)
k = 50
r = 250
lowrank_r = 50
m = 1000
n = 500
for ix in range(10):
    x = np.random.randn(m,n)
    ux, _ , vhx = np.linalg.svd(x,full_matrices=False) # get 1000 x 1000 and 500 x 500 orthogonal matrices
    sx = np.hstack([np.zeros(r-1,),np.arange(250,501)])[::-1] # the ground truth singular values we will use
    sx_lowrank = np.hstack([np.zeros(449,),np.arange(450,501)])[::-1] # the spectrum of the best rank-50 approximation
    y = (ux*sx)@vhx # the ground truth matrix we want to approximate with SVD/PCA
    y_lowrank = (ux*sx_lowrank) @ vhx # the best rank-50 approximation if we had the exact spectrum
    pca_operator = TruncatedSVD(k) 
    pca_operator.fit(y)
    sy = pca_operator.singular_values_ # our estimate of the singular values
    y_transformed = pca_operator.transform(y)
    y_reconstructed = y_transformed@pca_operator.components_ # the previous two operations are equivalent to pca_operator.inverse_transform(pca_operator.transform(y)), this is our estimated rank 50 approximation
    L2_sv_error[ix] = np.linalg.norm(sx[:lowrank_r]-sy) # how close did we get to estimating the top 50 singular values in L2?
    F_lowrank_error[ix] = np.linalg.norm(y_lowrank-y_reconstructed) # how close is our rank 50 approximation?
>>> print(np.mean(L2_sv_error))
55.77929710543534
>>> print(np.mean(F_lowrank_error))
1930.877814905553

It is clear that there is a problem

It turns out that we can increase k and our estimate gets better

L2_sv_error = np.zeros(10,)
F_lowrank_error = np.zeros(10,)
k = 100
r = 250
lowrank_r = 50
m = 1000
n = 500
for ix in range(10):
    x = np.random.randn(m,n)
    ux, _ , vhx = np.linalg.svd(x,full_matrices=False) # get 1000 x 1000 and 500 x 500 orthogonal matrices
    sx = np.hstack([np.zeros(r-1,),np.arange(r,n+1)])[::-1] # the ground truth singular values we will use
    sx_lowrank = np.hstack([np.zeros(n-lowrank_r-1,),np.arange(n-lowrank_r,n+1)])[::-1] # the spectrum of the best rank-50 approximation
    y = (ux*sx)@vhx # the ground truth matrix we want to approximate with SVD/PCA
    y_lowrank = (ux*sx_lowrank) @ vhx # the best rank-50 approximation if we had the exact spectrum
    pca_operator = TruncatedSVD(k) 
    pca_operator.fit(y)
    sy = pca_operator.singular_values_ # our estimate of the singular values
    y_transformed = pca_operator.transform(y)
    y_reconstructed = y_transformed[:,:lowrank_r]@pca_operator.components_[:lowrank_r,:] # the previous two operations are equivalent to pca_operator.inverse_transform(pca_operator.transform(y)), this is our estimated rank 50 approximation
    L2_sv_error[ix] = np.linalg.norm(sx[:lowrank_r]-sy[:lowrank_r]) # how close did we get to estimating the top 50 singular values in L2?
    F_lowrank_error[ix] = np.linalg.norm(y_lowrank-y_reconstructed) # how close is our rank 50 approximation?
>>> print(np.mean(L2_sv_error))
7.755560034045646
>>> print(np.mean(F_lowrank_error))
729.3025096061267

We can also decrease the rank of the underlying approximation to get better accuracy. What is happening here is that randomized_svd gets more accurate when there are more singular vectors requested, proportional to the rank of the matrix. As n_components gets closer to (and larger than) to the rank of the matrix, the algorithm gets more accurate. Let's finally look at the extreme case and compare our rank 1 approximation. The task here is to only estimate a single singular pair.

L2_sv_error = np.zeros(10,)
F_lowrank_error = np.zeros(10,)
k = 1
r = 250
lowrank_r = 1
m = 1000
n = 500
for ix in range(10):
    x = np.random.randn(m,n)
    ux, _ , vhx = np.linalg.svd(x,full_matrices=False) # get 1000 x 1000 and 500 x 500 orthogonal matrices
    sx = np.hstack([np.zeros(r-1,),np.arange(r,n+1)])[::-1] # the ground truth singular values we will use
    sx_lowrank = np.hstack([np.zeros(n-lowrank_r-1,),np.arange(n-lowrank_r,n+1)])[::-1] # the spectrum of the best rank-50 approximation
    y = (ux*sx)@vhx # the ground truth matrix we want to approximate with SVD/PCA
    y_lowrank = (ux*sx_lowrank) @ vhx # the best rank-50 approximation if we had the exact spectrum
    pca_operator = TruncatedSVD(k) 
    pca_operator.fit(y)
    sy = pca_operator.singular_values_ # our estimate of the singular values
    y_transformed = pca_operator.transform(y)
    y_reconstructed = y_transformed[:,:lowrank_r]@pca_operator.components_[:lowrank_r,:] # the previous two operations are equivalent to pca_operator.inverse_transform(pca_operator.transform(y)), this is our estimated rank 50 approximation
    L2_sv_error[ix] = np.linalg.norm(sx[:lowrank_r]-sy[:lowrank_r]) # how close did we get to estimating the top 50 singular values in L2?
    F_lowrank_error[ix] = np.linalg.norm(y_lowrank-y_reconstructed) # how close is our rank 50 approximation?
print(np.mean(L2_sv_error))
print(np.mean(F_lowrank_error))
>>> print(np.mean(L2_sv_error))
10.74597935589037
>>> print(np.mean(F_lowrank_error))
772.1048200820816

We can make the algorithm more accurate

It turns out that there are a lot of edge cases and examples where randomized svd will fail either because the matrix is too large, ill-conditioned, the rank is too high, etc. However, there are a few parameters that can be tweaked in the inner function of randomized svd, sklearn.utils.extmath.randomized_svd to make things more accurate. The biggest one is n_oversamples, and then n_iters

How to change graphtools

I propose that we replace all calls to PCA and Truncated SVD with explicit calls to randomized_svd and we set sensible n_oversamples as a factor of the requested n_pca. The default is not very good. The sklearn documentation suggests for a rank k matrix n_oversamples should be 2*k-n_components or just simply n_components when n_components >= k, but I have found for hard problems this is not enough. We can also add an svd_kwargs keyword argument to the graph constructors to allow passing through kwargs to randomized SVD to increase accuracy or trade accuracy for performance.

@scottgigante @dburkhardt

stanleyjs avatar Nov 17 '21 03:11 stanleyjs