graphtools
graphtools copied to clipboard
Inaccuracy in randomized SVD at moderate ranks
Hi,
Randomized SVD is not accurate
Currently most (if not all) of the PCA / linear dimensionality reduction / SVD is first routed through either TruncatedSVD
or PCA(svd_solver='randomized')
. It turns out that this solver can be pretty bad at computing even moderate rank SVDs. Consider this pathological example in which we create a 1000 x 500 matrix with np.hstack([np.zeros(249,),np.arange(250,501)])
as its spectrum. The matrix is rank 250. We will also consider its rank-50 reconstruction and its rank 1 approximation.
import numpy as np
from sklearn.decomposition import TruncatedSVD
L2_sv_error = np.zeros(10,)
F_lowrank_error = np.zeros(10,)
k = 50
r = 250
lowrank_r = 50
m = 1000
n = 500
for ix in range(10):
x = np.random.randn(m,n)
ux, _ , vhx = np.linalg.svd(x,full_matrices=False) # get 1000 x 1000 and 500 x 500 orthogonal matrices
sx = np.hstack([np.zeros(r-1,),np.arange(250,501)])[::-1] # the ground truth singular values we will use
sx_lowrank = np.hstack([np.zeros(449,),np.arange(450,501)])[::-1] # the spectrum of the best rank-50 approximation
y = (ux*sx)@vhx # the ground truth matrix we want to approximate with SVD/PCA
y_lowrank = (ux*sx_lowrank) @ vhx # the best rank-50 approximation if we had the exact spectrum
pca_operator = TruncatedSVD(k)
pca_operator.fit(y)
sy = pca_operator.singular_values_ # our estimate of the singular values
y_transformed = pca_operator.transform(y)
y_reconstructed = y_transformed@pca_operator.components_ # the previous two operations are equivalent to pca_operator.inverse_transform(pca_operator.transform(y)), this is our estimated rank 50 approximation
L2_sv_error[ix] = np.linalg.norm(sx[:lowrank_r]-sy) # how close did we get to estimating the top 50 singular values in L2?
F_lowrank_error[ix] = np.linalg.norm(y_lowrank-y_reconstructed) # how close is our rank 50 approximation?
>>> print(np.mean(L2_sv_error))
55.77929710543534
>>> print(np.mean(F_lowrank_error))
1930.877814905553
It is clear that there is a problem
It turns out that we can increase k
and our estimate gets better
L2_sv_error = np.zeros(10,)
F_lowrank_error = np.zeros(10,)
k = 100
r = 250
lowrank_r = 50
m = 1000
n = 500
for ix in range(10):
x = np.random.randn(m,n)
ux, _ , vhx = np.linalg.svd(x,full_matrices=False) # get 1000 x 1000 and 500 x 500 orthogonal matrices
sx = np.hstack([np.zeros(r-1,),np.arange(r,n+1)])[::-1] # the ground truth singular values we will use
sx_lowrank = np.hstack([np.zeros(n-lowrank_r-1,),np.arange(n-lowrank_r,n+1)])[::-1] # the spectrum of the best rank-50 approximation
y = (ux*sx)@vhx # the ground truth matrix we want to approximate with SVD/PCA
y_lowrank = (ux*sx_lowrank) @ vhx # the best rank-50 approximation if we had the exact spectrum
pca_operator = TruncatedSVD(k)
pca_operator.fit(y)
sy = pca_operator.singular_values_ # our estimate of the singular values
y_transformed = pca_operator.transform(y)
y_reconstructed = y_transformed[:,:lowrank_r]@pca_operator.components_[:lowrank_r,:] # the previous two operations are equivalent to pca_operator.inverse_transform(pca_operator.transform(y)), this is our estimated rank 50 approximation
L2_sv_error[ix] = np.linalg.norm(sx[:lowrank_r]-sy[:lowrank_r]) # how close did we get to estimating the top 50 singular values in L2?
F_lowrank_error[ix] = np.linalg.norm(y_lowrank-y_reconstructed) # how close is our rank 50 approximation?
>>> print(np.mean(L2_sv_error))
7.755560034045646
>>> print(np.mean(F_lowrank_error))
729.3025096061267
We can also decrease the rank of the underlying approximation to get better accuracy. What is happening here is that randomized_svd
gets more accurate when there are more singular vectors requested, proportional to the rank of the matrix. As n_components
gets closer to (and larger than) to the rank of the matrix, the algorithm gets more accurate. Let's finally look at the extreme case and compare our rank 1 approximation. The task here is to only estimate a single singular pair.
L2_sv_error = np.zeros(10,)
F_lowrank_error = np.zeros(10,)
k = 1
r = 250
lowrank_r = 1
m = 1000
n = 500
for ix in range(10):
x = np.random.randn(m,n)
ux, _ , vhx = np.linalg.svd(x,full_matrices=False) # get 1000 x 1000 and 500 x 500 orthogonal matrices
sx = np.hstack([np.zeros(r-1,),np.arange(r,n+1)])[::-1] # the ground truth singular values we will use
sx_lowrank = np.hstack([np.zeros(n-lowrank_r-1,),np.arange(n-lowrank_r,n+1)])[::-1] # the spectrum of the best rank-50 approximation
y = (ux*sx)@vhx # the ground truth matrix we want to approximate with SVD/PCA
y_lowrank = (ux*sx_lowrank) @ vhx # the best rank-50 approximation if we had the exact spectrum
pca_operator = TruncatedSVD(k)
pca_operator.fit(y)
sy = pca_operator.singular_values_ # our estimate of the singular values
y_transformed = pca_operator.transform(y)
y_reconstructed = y_transformed[:,:lowrank_r]@pca_operator.components_[:lowrank_r,:] # the previous two operations are equivalent to pca_operator.inverse_transform(pca_operator.transform(y)), this is our estimated rank 50 approximation
L2_sv_error[ix] = np.linalg.norm(sx[:lowrank_r]-sy[:lowrank_r]) # how close did we get to estimating the top 50 singular values in L2?
F_lowrank_error[ix] = np.linalg.norm(y_lowrank-y_reconstructed) # how close is our rank 50 approximation?
print(np.mean(L2_sv_error))
print(np.mean(F_lowrank_error))
>>> print(np.mean(L2_sv_error))
10.74597935589037
>>> print(np.mean(F_lowrank_error))
772.1048200820816
We can make the algorithm more accurate
It turns out that there are a lot of edge cases and examples where randomized svd will fail either because the matrix is too large, ill-conditioned, the rank is too high, etc. However, there are a few parameters that can be tweaked in the inner function of randomized svd, sklearn.utils.extmath.randomized_svd
to make things more accurate. The biggest one is n_oversamples
, and then n_iters
How to change graphtools
I propose that we replace all calls to PCA and Truncated SVD with explicit calls to randomized_svd
and we set sensible n_oversamples
as a factor of the requested n_pca
. The default is not very good. The sklearn documentation suggests for a rank k
matrix n_oversamples
should be 2*k-n_components
or just simply n_components
when n_components >= k
, but I have found for hard problems this is not enough. We can also add an svd_kwargs
keyword argument to the graph constructors to allow passing through kwargs to randomized SVD to increase accuracy or trade accuracy for performance.
@scottgigante @dburkhardt