cca_zoo
                                
                                 cca_zoo copied to clipboard
                                
                                    cca_zoo copied to clipboard
                            
                            
                            
                        GRCCA gives quite different results with original implementation
Hi James, similar to #107, I implemented a test to check if cca_zoo.models.GRCCA gives similar results as the original implementation. But the feature weights are quite dissimilar especially for the much smaller y input array:

Here's the testing code:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Sanity check if cca-zoo's implementation of GRCCA gives the same result
as the original implementation'
Requires an environment with:
- cca-zoo 
- r-base (conda install -c conda-forge r-base)
- r-devtools (conda install -c conda-forge r-devtools)
- RCCA (installed via: devtools::install_github('https://github.com/ElenaTuzhilina/RCCA')
- rpy2 (conda install -c conda-forge rpy2)
Useful links:
- https://github.com/murraylab/gemmr/blob/master/gemmr/estimators/r_estimators.py
- https://github.com/Teekuningas/sparsecca/blob/master/tests/test_compare_pmd_to_r.py
    
@author: johannes.wiesner
"""
import numpy as np
rng = np.random.RandomState(42)
from cca_zoo.models import GRCCA
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()
import matplotlib.pyplot as plt
###############################################################################
# Get original data  and parameters as in original example
###############################################################################
robjects.r(
'''
library(RCCA) 
data(X)
data(Y)
n.groups = 5
group1 = rep(1:n.groups,rep(ncol(X)/n.groups,n.groups))
group2 = rep(1,ncol(Y))
lambda1 = 1000
lambda2 = 0
mu1 = 100
mu2 = 0
''')
X = np.array(robjects.globalenv['X'])
y = np.array(robjects.globalenv['Y'])
group1 = np.array(robjects.globalenv['group1'])
group2 = np.array(robjects.globalenv['group2'])
lambda1 = robjects.globalenv['lambda1'][0]
lambda2 = robjects.globalenv['lambda2'][0]
mu1 = robjects.globalenv['mu1'][0]
mu2 = robjects.globalenv['mu2'][0]
###############################################################################
# Run original and cca-zoo implementation and return coefficients for X
###############################################################################
def test(X,y,group1,group2,lambda1,lambda2,mu1,mu2,rng):
    # run original implementation
    RCCA = importr('RCCA')
    results_R = RCCA.GRCCA(X=X,
                           Y=y,
                           group1=group1,
                           group2=group2,
                           lambda1=lambda1,
                           lambda2=lambda2,
                           mu1=mu1,
                           mu2=mu2)
    
    # save original coefficients
    xcoefs = results_R.rx2('x.coefs')
    ycoefs = results_R.rx2('y.coefs')
    
    # save number of components because in the orignal implementation number
    # of components is always set to maximum using n.comp = min(ncol(X), ncol(Y), nrow(X))
    n_comp = results_R.rx2('n.comp')[0]
    
    # re-run with cca-zoo implementation
    # TODO: I think original implementation does not standardize?
    estimator = GRCCA(latent_dims=n_comp,
                      c=[lambda1,lambda2],
                      mu=[mu1,mu2],
                      random_state=rng,
                      scale=False,
                      centre=False)
    
    # in python, things start with 0, not with 1
    feature_groups = [np.int64(group1)-1,np.int64(group2)-1]
    estimator.fit([X,y],feature_groups=feature_groups)
    
    # save replicated coefficients
    X_weights,y_weights = estimator.weights
    
    return xcoefs,X_weights,ycoefs,y_weights
xcoefs,X_weights,ycoefs,y_weights = test(X,y,group1,group2,lambda1,lambda2,mu1,mu2,rng)
###############################################################################
# Plot deviations
###############################################################################
for org,rep in [[xcoefs,X_weights],[ycoefs,y_weights]]:
    # plot distribution of deviations (ignore zeros for plotting purposes)
    plt.figure()
    deviations = np.abs(org)-np.abs(rep)
    deviations = deviations.flatten()
    deviations = deviations[deviations != 0]
    plt.hist(deviations)
    
###############################################################################
# Throw error if not close
###############################################################################
assert(np.allclose(np.abs(xcoefs),np.abs(X_weights)))
assert(np.allclose(np.abs(ycoefs),np.abs(y_weights)))
Looks like a scaling thing a bit like #162 - the correlations of the weights are >0.90 or higher. I'm pushing an update that will ensure that for cca_zoo models wXXw=n where n is the number of samples.
 

putting weights on the same scale
Would be good to get a bit closer though I'll take another look at the implementation.
ok think I've found the source of the error. Will let you know the results soon.
OK the source of this is basically a misalignment between RCCA lambda and cca_zoo c.
lambda gets added to the covariance matrices but c slides between 0 regularisation and 1 maximal regularisation (all ones on the diagonal).
Would it make sense to include a testing folder somewhere in cca_zoo that checks original R-implementations against cca_zoos doppelgängers? I noticed that it's actually quite handy to just be able to import R-packages with RCCA = importr('RCCA'). Let me know if you're interested. I don't know anything about CI, but at least some folders for semi-automatic tests maybe?
Yeah definitely! Will almost certainly add this one in somewhere. Obviously there's some sharp edges (e.g. here where the parameterizations don't quite match) but no harm in having them
I've used variants of your script to check that the source of the difference is lambda/c and convinced myself.
PRCCA with no regularisation (and fewer features than samples) everything matches
GRCCA with very high regularisation also matches
GRCCA with no regularisation doesn't match but it is because of line 91 in https://github.com/ElenaTuzhilina/RCCA/blob/master/R/GRCCA.R where PRCCA is called with lambda=1 regardless of what the user inputs. This means it's not really possible to match cca_zoo and RCCA up in this instance. It seems a bit odd that they do this I think it's possible its a bug in RCCA.
Yeah definitely! Will almost certainly add this one in somewhere. Obviously, there's some sharp edges (e.g. here where the parameterizations don't quite match) but no harm in having them
Let me know if I can help with a PR!
GRCCA with no regularisation doesn't match but it is because of line 91 in https://github.com/ElenaTuzhilina/RCCA/blob/master/R/GRCCA.R where PRCCA is called with lambda=1 regardless of what the user inputs. This means it's not really possible to match cca_zoo and RCCA up in this instance. It seems a bit odd that they do this I think it's possible its a bug in RCCA.
Ah okay. Do you still have to update cca_zoo.models.GRCCA then or is the deviation purely stemming from this bug?