cca_zoo
cca_zoo copied to clipboard
Add method to compute explained covariance
I know this is easy to compute, but still nice to have out-of-the-box: Wouldn't it make sense to also add a compute_covariance method to _CCA_Base? Like such:
def explained_variance(X,y,X_weights,y_weights):
'''Compute covariance'''
# standardize X and y
X_std = zscore(X,axis=0,ddof=1)
y_std = zscore(y,axis=0,ddof=1)
# calculate covariance matrix
covmat = X_weights.T @ X_std.T @ y_std @ y_weights
# calculate covariance explained by each component
explained_var = np.diag(covmat)**2 / np.sum(np.diag(covmat)**2)
return explained_var
Yeah cool idea - would be delighted to accept a PR with essentially what you have described here. Otherwise will take a look in the future when I have some spare time :)
Have you got a citation for the calculation?
Yes, implemented it from here . Therefore the corresponding citation would be:
DOI: 10.1038/s41467-018-05317-y
I currently do not have the time for a PR, but I will try to do it as soon as I have more time!
I know this is easy to compute, but still nice to have out-of-the-box: Wouldn't it make sense to also add a
compute_covariancemethod to_CCA_Base? Like such:def explained_variance(X,y,X_weights,y_weights): '''Compute covariance''' # standardize X and y X_std = zscore(X,axis=0,ddof=1) y_std = zscore(y,axis=0,ddof=1) # calculate covariance matrix covmat = X_weights.T @ X_std.T @ y_std @ y_weights # calculate covariance explained by each component explained_var = np.diag(covmat)**2 / np.sum(np.diag(covmat)**2) return explained_var
I'm trying to understand, the explained variance that is calculated here, is something like CCX1 variance in explaining X, or CCX1 variance in explaining CCY1? CCX1: constructed 1st component for X
Thanks in advance!
AFAIK this is the explained covariance so it does explain how much of the total covariance of X and y can be explained by the covariance of the respective variates.
I have actually moved the explained variance out of the BaseModel class (it is available to models with 'PLS' style constraints).
This is because the CCA models in this package are normalized so that the transformed/low-dimensional representation/canonical variates have unit variance in expectation whereas the weights having variance as in PLS style models.
So to compute explained variance for a CCA model you would first need to normalize the weights to unit variance.
I could add back in the explained variance method with something like a normalized_weights attribute that scales the weights. Something like the below
def _normalized_weights(self):
"""
Returns the normalized weights
Returns
-------
normalized_weights : list of numpy arrays
"""
return [w / np.linalg.norm(w, axis=0) for w in self.weights]
In fact this strikes me as a good idea.
And I agree with @JohannesWiesner description of explained variance having been through a similar struggle in the literature a while back @WantongLi123!
In fact I'm not sure about the implementation you quoted of my previous understanding!
The gist is the total variance in each view of our data can be computed as e.g. the sum of singular values/n or /n-1 with bessels correction.
If we find a subspace of our data (and importantly don't scale it ie we transform with unit vectors), then we can look at the variance of our subspace by the same method and explained variance is the ratio of variance in the subspace divided by variance in the original data
Hey James, happy to catch you online. i try to calculate explained variance ratio for MCCA, multiple views, straggling for a day and failed. ANY chance that you incorporate that function code?
This will be tacked onto BaseModel in the current version of the package:
def explained_variance(self, views: Iterable[np.ndarray]):
"""
Calculates the explained variance for each latent dimension.
Parameters
----------
views : list/tuple of numpy arrays or array likes with the same number of rows (samples)
Returns
-------
explained_variance_ratios : list of numpy arrays
"""
# Ensure that the model is already fitted
check_is_fitted(self, attributes=["weights"])
# Normalize the weight vectors to unit variance
normalized_weights = [
weights / np.linalg.norm(weights,axis=0) for weights in self.weights
]
# Transform the views using the normalized weights
transformed_views = [
view @ weights for view, weights in zip(views, normalized_weights)
]
# Calculate total variance of the original data using singular values
total_vars = [
(np.sum(s**2) / (view.shape[0] - 1))
for view in views
for _, s, _ in [svd(view)]
]
# Calculate the variance of each latent dimension in the transformed views
transformed_vars = [
np.var(transformed,axis=0) for transformed in transformed_views
]
# Calculate the explained variance ratio for each latent dimension for each view
explained_variance_ratios = [
transformed_var / total_var for transformed_var, total_var in zip(transformed_vars, total_vars)
]
return explained_variance_ratios
And should work.
Have now added in explained_variance, explained_variance_ratio, explained_variance_cumulative as well as explained_covariance, explained_covariance_ratio, explained_covariance_cumulative in the latest release
This has been achieved by adding a new class attribute 'loadings' which are normalized weights which are equivalent to PCA loadings ie ||w||=1 and w=loadings.
I have renamed factor_loadings to canonical loadings following stackoverflow discussion while recognising the confusion that exists in the literature.
Thanks for the encouragement to fix this up @WantongLi123! It's a nice change that I think has a lot of value in post-hoc analysis of these kind of models.