cca_zoo icon indicating copy to clipboard operation
cca_zoo copied to clipboard

Add method to compute explained covariance

Open JohannesWiesner opened this issue 3 years ago • 4 comments

I know this is easy to compute, but still nice to have out-of-the-box: Wouldn't it make sense to also add a compute_covariance method to _CCA_Base? Like such:

def explained_variance(X,y,X_weights,y_weights):
    '''Compute covariance'''

    # standardize X and y 
    X_std = zscore(X,axis=0,ddof=1)
    y_std = zscore(y,axis=0,ddof=1)
    
    # calculate covariance matrix
    covmat = X_weights.T @ X_std.T @ y_std @ y_weights
    
    # calculate covariance explained by each component
    explained_var =  np.diag(covmat)**2 / np.sum(np.diag(covmat)**2) 

    return explained_var

JohannesWiesner avatar Feb 10 '22 16:02 JohannesWiesner

Yeah cool idea - would be delighted to accept a PR with essentially what you have described here. Otherwise will take a look in the future when I have some spare time :)

jameschapman19 avatar Feb 13 '22 16:02 jameschapman19

Have you got a citation for the calculation?

jameschapman19 avatar Mar 04 '22 13:03 jameschapman19

Yes, implemented it from here . Therefore the corresponding citation would be:

DOI: 10.1038/s41467-018-05317-y

JohannesWiesner avatar Mar 08 '22 15:03 JohannesWiesner

I currently do not have the time for a PR, but I will try to do it as soon as I have more time!

JohannesWiesner avatar Mar 08 '22 15:03 JohannesWiesner

I know this is easy to compute, but still nice to have out-of-the-box: Wouldn't it make sense to also add a compute_covariance method to _CCA_Base? Like such:

def explained_variance(X,y,X_weights,y_weights):
    '''Compute covariance'''

    # standardize X and y 
    X_std = zscore(X,axis=0,ddof=1)
    y_std = zscore(y,axis=0,ddof=1)
    
    # calculate covariance matrix
    covmat = X_weights.T @ X_std.T @ y_std @ y_weights
    
    # calculate covariance explained by each component
    explained_var =  np.diag(covmat)**2 / np.sum(np.diag(covmat)**2) 

    return explained_var

I'm trying to understand, the explained variance that is calculated here, is something like CCX1 variance in explaining X, or CCX1 variance in explaining CCY1? CCX1: constructed 1st component for X

Thanks in advance!

WantongLi123 avatar Aug 16 '23 15:08 WantongLi123

AFAIK this is the explained covariance so it does explain how much of the total covariance of X and y can be explained by the covariance of the respective variates.

JohannesWiesner avatar Aug 17 '23 12:08 JohannesWiesner

I have actually moved the explained variance out of the BaseModel class (it is available to models with 'PLS' style constraints).

This is because the CCA models in this package are normalized so that the transformed/low-dimensional representation/canonical variates have unit variance in expectation whereas the weights having variance as in PLS style models.

So to compute explained variance for a CCA model you would first need to normalize the weights to unit variance.

I could add back in the explained variance method with something like a normalized_weights attribute that scales the weights. Something like the below

    def _normalized_weights(self):
        """
        Returns the normalized weights

        Returns
        -------
        normalized_weights : list of numpy arrays

        """
        return [w / np.linalg.norm(w, axis=0) for w in self.weights]

jameschapman19 avatar Aug 17 '23 12:08 jameschapman19

In fact this strikes me as a good idea.

And I agree with @JohannesWiesner description of explained variance having been through a similar struggle in the literature a while back @WantongLi123!

In fact I'm not sure about the implementation you quoted of my previous understanding!

The gist is the total variance in each view of our data can be computed as e.g. the sum of singular values/n or /n-1 with bessels correction.

If we find a subspace of our data (and importantly don't scale it ie we transform with unit vectors), then we can look at the variance of our subspace by the same method and explained variance is the ratio of variance in the subspace divided by variance in the original data

jameschapman19 avatar Aug 17 '23 12:08 jameschapman19

Hey James, happy to catch you online. i try to calculate explained variance ratio for MCCA, multiple views, straggling for a day and failed. ANY chance that you incorporate that function code?

WantongLi123 avatar Aug 17 '23 12:08 WantongLi123

This will be tacked onto BaseModel in the current version of the package:

    def explained_variance(self, views: Iterable[np.ndarray]):
        """
        Calculates the explained variance for each latent dimension.

        Parameters
        ----------
        views : list/tuple of numpy arrays or array likes with the same number of rows (samples)

        Returns
        -------
        explained_variance_ratios : list of numpy arrays

        """
        # Ensure that the model is already fitted
        check_is_fitted(self, attributes=["weights"])

        # Normalize the weight vectors to unit variance
        normalized_weights = [
            weights / np.linalg.norm(weights,axis=0) for weights in self.weights
        ]

        # Transform the views using the normalized weights
        transformed_views = [
            view @ weights for view, weights in zip(views, normalized_weights)
        ]

        # Calculate total variance of the original data using singular values
        total_vars = [
            (np.sum(s**2) / (view.shape[0] - 1))
            for view in views
            for _, s, _ in [svd(view)]
        ]

        # Calculate the variance of each latent dimension in the transformed views
        transformed_vars = [
            np.var(transformed,axis=0) for  transformed in transformed_views
        ]

        # Calculate the explained variance ratio for each latent dimension for each view
        explained_variance_ratios = [
            transformed_var / total_var for transformed_var, total_var in zip(transformed_vars, total_vars)
        ]

        return explained_variance_ratios

And should work.

jameschapman19 avatar Aug 17 '23 13:08 jameschapman19

Have now added in explained_variance, explained_variance_ratio, explained_variance_cumulative as well as explained_covariance, explained_covariance_ratio, explained_covariance_cumulative in the latest release

This has been achieved by adding a new class attribute 'loadings' which are normalized weights which are equivalent to PCA loadings ie ||w||=1 and w=loadings.

I have renamed factor_loadings to canonical loadings following stackoverflow discussion while recognising the confusion that exists in the literature.

jameschapman19 avatar Aug 18 '23 13:08 jameschapman19

Thanks for the encouragement to fix this up @WantongLi123! It's a nice change that I think has a lot of value in post-hoc analysis of these kind of models.

jameschapman19 avatar Aug 18 '23 13:08 jameschapman19