Comparison between standard pearson/spearman correlation and propr
Hi, I'm running a correlation/proportionality analysis on a multiomic dataset (metagenomic, metatranscriptomic and metaproteomic). I'm trying to correlate features in one dataset to the same features in one of the other datasets, so for example the abundance of a gene in the metagenomes to the same gene in the metatranscriptome.
I've done this both with propr, following the 'column join' strategy as suggested in Quinn et al 2019 where counts in each dataset are log-transformed separately with CLR then joined into one table before analysing with propr:
pr <- propr(
counts = df,
metric = "rho",
ivar = NA,
alpha = NA,
p = 100
)
pr <- updateCutoffs(
object = pr,
custom_cutoffs = seq(.01, 1, .01),
tails = "right",
ncores = 8
)
This gives me an all-against-all comparison of the features, but because I'm primarily interested in how a single feature relates to the same feature between datasets I've then filtered the results to only these pairwise comparisons.
In parallell I've also run a "standard" correlation analysis of the CLR-transformed counts. For this I run pearson correlation on features that have a normal distribution in both compared datasets, and spearman correlations on the remaining features, followed by p-value adjustment (all in python).
I've used a FDR of 0.05 to get rho cutoffs for propr, and have used the same value as a cutoff for adjusted p-values in the "standard" analysis. A comparison of the results shows that the methods are largely in agreement, at least in the comparison when the number of samples is large (n=19). In that case most features are marked as significantly proportional/correlated with both methods. The scatter plot below shows the correlation coefficient from propr (y-axis) vs pearson/spearman (x-axis) for genome features (counts summed to assigned genome). Colors indicate if genomes are significant in None, both or only one method.
While it's reassuring that the methods agree, I'm of course interested in the cases when they differ and it'd be very interesting to hear your thoughts on this. Firstly, the number of samples seems to be an important factor here as propr finds almost no significant features in the comparisons when I have 12 or 11 samples. Secondly, it seems that features only identified by propr are less abundant compared to features only identified with pearson/spearman.
I've tried to follow the propr source code for how the rho-matrix is calculated but don't understand if it's a standard spearman correlation of the transformed input or if something else is also being done.
Thanks in advance!