pca-magic icon indicating copy to clipboard operation
pca-magic copied to clipboard

Total variance explained > 1

Open tsgouvea opened this issue 5 years ago • 3 comments

Forgive me if this is a known counterintuitive point deemed irrelevant, but I noticed total variance explained by all components is greater than one. That's true in my dataset with missing values, but also in the complete example below.

import numpy as np
from ppca import PPCA

x = np.random.randn(50,20)
m = PPCA()
m.fit(data=x)

print(m.var_exp)
[0.11460246 0.21691676 0.30977113 0.40169889 0.4885789  0.56032857
 0.62697946 0.68458968 0.73693932 0.78439966 0.82519526 0.86416853
 0.89399395 0.9215888  0.94654215 0.9696493  0.98877802 1.00211534
 1.01186025 1.02040816]

It seems to be related to the fact that the sum of all eigenvalues is greater than the number of dimensions in the original dataset. Since sum of eigenvalues should be equal to trace of correlation matrix, I would not expect that to be the case.

print(np.cumsum(m.eig_vals)/20.)
[0.11460246 0.21691676 0.30977113 0.40169889 0.4885789  0.56032857
 0.62697946 0.68458968 0.73693932 0.78439966 0.82519526 0.86416853
 0.89399395 0.9215888  0.94654215 0.9696493  0.98877802 1.00211534
 1.01186025 1.02040816]

print(m.eig_vals.sum())
20.40816326530614

tsgouvea avatar Nov 05 '18 16:11 tsgouvea

Correct me if this is wrong, but is the above because var_exp is the variance of each component as opposed to the percentage explained variance?

burtonrj avatar Aug 02 '20 16:08 burtonrj

This is what i see in the README.md

variance_explained = ppca.var_exp

It let's me think, that var_exp should be the variance explained.

ichitaka avatar Aug 02 '20 20:08 ichitaka

Not really. According to the code, the reason is an inconsistent use of degrees of freedom in the calculation of covariance (as in the numerator) and variance (as in the denominator).

When calculating variance explained by each PC, np.cov is used with default parameters. Based on the documentation of np.cov, by default, the divisor is sample size less 1. https://github.com/allentran/pca-magic/blob/7eef1c02194c718c10fb7e33aa31eafca3b8fa74/ppca/_ppca.py#L98

However, when calculating the total variance, np.nanvar is used with default parameters. Based on numpy's documentation, in this case, the divisor is the sample size, i.e. no degrees of freedom adjustment. https://github.com/allentran/pca-magic/blob/7eef1c02194c718c10fb7e33aa31eafca3b8fa74/ppca/_ppca.py#L127-L129

As a result, the denominator is slightly smaller or the numerator is slightly larger, which causes the >1 explained variance ratio. A simple fix is to multiply var_exp by (N-1)/N where N is the sample size.

flcong avatar Nov 27 '23 03:11 flcong