cca_zoo GRCCA gives quite different results with original implementation

Hi James, similar to #107, I implemented a test to check if cca_zoo.models.GRCCA gives similar results as the original implementation. But the feature weights are quite dissimilar especially for the much smaller y input array:

Here's the testing code:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Sanity check if cca-zoo's implementation of GRCCA gives the same result
as the original implementation'

Requires an environment with:
- cca-zoo 
- r-base (conda install -c conda-forge r-base)
- r-devtools (conda install -c conda-forge r-devtools)
- RCCA (installed via: devtools::install_github('https://github.com/ElenaTuzhilina/RCCA')
- rpy2 (conda install -c conda-forge rpy2)

Useful links:
- https://github.com/murraylab/gemmr/blob/master/gemmr/estimators/r_estimators.py
- https://github.com/Teekuningas/sparsecca/blob/master/tests/test_compare_pmd_to_r.py
    

@author: johannes.wiesner

"""


import numpy as np
rng = np.random.RandomState(42)
from cca_zoo.models import GRCCA
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()
import matplotlib.pyplot as plt

###############################################################################
# Get original data  and parameters as in original example
###############################################################################

robjects.r(
'''
library(RCCA) 
data(X)
data(Y)
n.groups = 5
group1 = rep(1:n.groups,rep(ncol(X)/n.groups,n.groups))
group2 = rep(1,ncol(Y))
lambda1 = 1000
lambda2 = 0
mu1 = 100
mu2 = 0
''')

X = np.array(robjects.globalenv['X'])
y = np.array(robjects.globalenv['Y'])
group1 = np.array(robjects.globalenv['group1'])
group2 = np.array(robjects.globalenv['group2'])
lambda1 = robjects.globalenv['lambda1'][0]
lambda2 = robjects.globalenv['lambda2'][0]
mu1 = robjects.globalenv['mu1'][0]
mu2 = robjects.globalenv['mu2'][0]

###############################################################################
# Run original and cca-zoo implementation and return coefficients for X
###############################################################################

def test(X,y,group1,group2,lambda1,lambda2,mu1,mu2,rng):

    # run original implementation
    RCCA = importr('RCCA')
    results_R = RCCA.GRCCA(X=X,
                           Y=y,
                           group1=group1,
                           group2=group2,
                           lambda1=lambda1,
                           lambda2=lambda2,
                           mu1=mu1,
                           mu2=mu2)
    
    # save original coefficients
    xcoefs = results_R.rx2('x.coefs')
    ycoefs = results_R.rx2('y.coefs')
    
    # save number of components because in the orignal implementation number
    # of components is always set to maximum using n.comp = min(ncol(X), ncol(Y), nrow(X))
    n_comp = results_R.rx2('n.comp')[0]
    
    # re-run with cca-zoo implementation
    # TODO: I think original implementation does not standardize?
    estimator = GRCCA(latent_dims=n_comp,
                      c=[lambda1,lambda2],
                      mu=[mu1,mu2],
                      random_state=rng,
                      scale=False,
                      centre=False)
    
    # in python, things start with 0, not with 1
    feature_groups = [np.int64(group1)-1,np.int64(group2)-1]
    estimator.fit([X,y],feature_groups=feature_groups)
    
    # save replicated coefficients
    X_weights,y_weights = estimator.weights
    
    return xcoefs,X_weights,ycoefs,y_weights

xcoefs,X_weights,ycoefs,y_weights = test(X,y,group1,group2,lambda1,lambda2,mu1,mu2,rng)

###############################################################################
# Plot deviations
###############################################################################

for org,rep in [[xcoefs,X_weights],[ycoefs,y_weights]]:

    # plot distribution of deviations (ignore zeros for plotting purposes)
    plt.figure()
    deviations = np.abs(org)-np.abs(rep)
    deviations = deviations.flatten()
    deviations = deviations[deviations != 0]
    plt.hist(deviations)
    

###############################################################################
# Throw error if not close
###############################################################################

assert(np.allclose(np.abs(xcoefs),np.abs(X_weights)))
assert(np.allclose(np.abs(ycoefs),np.abs(y_weights)))

Feb 10 '23 15:02 JohannesWiesner

Looks like a scaling thing a bit like #162 - the correlations of the weights are >0.90 or higher. I'm pushing an update that will ensure that for cca_zoo models wXXw=n where n is the number of samples.

Feb 10 '23 15:02 jameschapman19

putting weights on the same scale

Feb 10 '23 15:02 jameschapman19

Would be good to get a bit closer though I'll take another look at the implementation.

Feb 10 '23 15:02 jameschapman19

ok think I've found the source of the error. Will let you know the results soon.

Feb 10 '23 16:02 jameschapman19

OK the source of this is basically a misalignment between RCCA lambda and cca_zoo c.

lambda gets added to the covariance matrices but c slides between 0 regularisation and 1 maximal regularisation (all ones on the diagonal).

Feb 10 '23 16:02 jameschapman19

Would it make sense to include a testing folder somewhere in cca_zoo that checks original R-implementations against cca_zoos doppelgängers? I noticed that it's actually quite handy to just be able to import R-packages with RCCA = importr('RCCA'). Let me know if you're interested. I don't know anything about CI, but at least some folders for semi-automatic tests maybe?

Feb 10 '23 16:02 JohannesWiesner

Yeah definitely! Will almost certainly add this one in somewhere. Obviously there's some sharp edges (e.g. here where the parameterizations don't quite match) but no harm in having them

Feb 10 '23 16:02 jameschapman19

I've used variants of your script to check that the source of the difference is lambda/c and convinced myself.

PRCCA with no regularisation (and fewer features than samples) everything matches

GRCCA with very high regularisation also matches

GRCCA with no regularisation doesn't match but it is because of line 91 in https://github.com/ElenaTuzhilina/RCCA/blob/master/R/GRCCA.R where PRCCA is called with lambda=1 regardless of what the user inputs. This means it's not really possible to match cca_zoo and RCCA up in this instance. It seems a bit odd that they do this I think it's possible its a bug in RCCA.

Feb 10 '23 16:02 jameschapman19

Yeah definitely! Will almost certainly add this one in somewhere. Obviously, there's some sharp edges (e.g. here where the parameterizations don't quite match) but no harm in having them

Let me know if I can help with a PR!

GRCCA with no regularisation doesn't match but it is because of line 91 in https://github.com/ElenaTuzhilina/RCCA/blob/master/R/GRCCA.R where PRCCA is called with lambda=1 regardless of what the user inputs. This means it's not really possible to match cca_zoo and RCCA up in this instance. It seems a bit odd that they do this I think it's possible its a bug in RCCA.

Ah okay. Do you still have to update cca_zoo.models.GRCCA then or is the deviation purely stemming from this bug?

Feb 10 '23 16:02 JohannesWiesner

cca_zoo cca_zoo copied to clipboard

GRCCA gives quite different results with original implementation

cca_zoo
cca_zoo copied to clipboard