propr icon indicating copy to clipboard operation
propr copied to clipboard

Vertical integration of datasets

Open johnne opened this issue 11 months ago • 3 comments

Thanks for a great package and very useful publications!

I'm working with a time-series of samples for which metagenomic (mg) and metatranscriptomic (mt) data has been generated using Illumina sequencing. Reads from all samples have been mapped to the same features (a collection of genes predicted in Metagenome Assembled Genomes (MAGs)) and these features have been annotated functionally with PROKKA and eggnog-mapper.

There are a total of 84 samples taken in the same location at different dates over three years and for 19 sampling dates there's both mg and mt data.

In one part of the project I'm trying to identify MAGs that are differentially abundant in the mt dataset compared to the mg dataset, i.e. whose transcriptional activity is significantly higher/lower. My first idea was to use ALDEx2 for this purpose which in my understanding is what you refer to in your Quinn et al 2019 paper under Vertical data integration:

This allows us to use ALDEx2 to find features where mRNA abundance changes more than protein abundance, relative to a common reference (and vice versa).

which in my case would translate to something like "find MAGs where mRNA abundance changes more than DNA abundance, relative to a common reference".

What I've done so far is:

  1. subset both datasets to the 19 dates with paired omics data (all samples combined into one counts table)
  2. sum raw counts for each MAG
  3. use the omics type as conditions ('mg' and 'mt')
  4. run a modular ALDEx2 analysis with paired t-test and effect size estimation

To me this seems like a reasonable approach but I'm not sure if I'm missing something? Is this more or less what you intended in the paper? Also, the results will be wrt the geometric mean of the full dataset, but is there a better reference to choose in this case?

I'm also wondering if the 'differential proportionality analysis' could be useful in this context? I've followed along the 1-pipeline.R script from the Supplementary information of your field-guide paper and have tried to understand how (if) it can be applied to my data.

The section of the script that runs:

# Get LPS-treated cells only
rna <- rnaseq.no0[rnaseq.annot$Treatment == "LPS",]
pro <- masshl.no0[masshl.annot$Treatment == "LPS",]

# Join as single matrix
merge <- rbind(rna, pro)
group <- c(rep("RNA", 14), rep("Protein", 14))

# Run propd analysis
pd.ms <- propd(merge, group)

and the downstream processing identifies a reference which the features are compared to. I'm not sure if that makes sense for my data because I would have to choose one MAG as a reference and compare the the other MAGs to that one. Maybe this makes sense, I just have a hard time understanding how to interpret something like that.

As an alternative, could I perhaps compare the gene-level abundances for each MAG in the mg and mt datasets and choose a house-keeping gene as a reference for each comparison? So the procedure would be something like:

  1. for each MAG (or a subset of MAGs) sum the counts in the mg and mt samples to the feature level (e.g. COGs or KEGG orthologs)
  2. run propd analysis on the feature-level data for the MAG, setting my mg and mt groups.
  3. explore the results using a house-keeping gene (such as Rpl30 or similar) as reference

I'm thinking this could be a complementary analysis to the ALDEx2 analysis I'm already doing where I'm looking at the MAGs as a whole.

As you might tell I'm struggling a bit here as I feel I almost grasp the concepts but not quite. I would be very grateful for any input you might have on this.

johnne avatar Jan 29 '25 14:01 johnne

G'day Johne,

Thanks for using propr.

Everything you've written makes sense to me, except I am unsure about this one point "Also, the results will be wrt the geometric mean of the full dataset, but is there a better reference to choose in this case?" I am not very familiar with this kind of experimental design, but I am used to taking a default approach to multi-omics data where I normalise each data set independently from one another. In pseudo-code, something like:

rbind(clr(data_set_1), clr(data_set_2))

as opposed to

clr(rbind(data_set_1, data_set_2))

By the way, my colleague Thomaz has recently written a new article on this topic that does a good job to set CoDA in a broader bioinformatics context. You might find it helpful https://www.nature.com/articles/s44220-023-00148-3 I'd recommend e-mailing him to see if he's written anything specifically on the topic MG/MT time-series, I believe it is (close to) his special interest. Me on the other hand, I've left academia for the public sector 3 years ago and am already getting rusty ^_^

tpq avatar Jan 29 '25 21:01 tpq

Thanks for the quick reply!

Yes your paper and the supplementary script also got me thinking I should run the CLR on each dataset separately first, thanks for confirming this. And thanks for the article tip, I'll look it up.

All the best, John

johnne avatar Jan 30 '25 06:01 johnne

Hi again, After reading up some more I think I have a better understanding of how to proceed to compare my metagenomic (mg) and metatranscriptomic (mt) counts, but wanted to check that I'm on the right track.

Quick recap: My mg and mt datasets have counts obtained by mapping reads to the same features, and for 19 of the sampling dates there's both mg and mt data, so for these 19 samples I want to calculate the proportionality of each feature in mg to the same feature in mt.

To use propr I would follow the "column join" strategy and, as you say, CLR-transform the mg and mt datasets separately, and do cbind(mg.clr, mt.clr) followed by running propr. Since I have the same features in both datasets I would have to suffix feature names in one dataset, e.g. adding .RNA to features in the mt dataset. I could then explore e.g. proportionality of featureA to featureA.RNA.

To calculate differential proportionality with propd I would follow the "row join" strategy (rbind(mg.clr, mt.clr)) and run propd with a group vector which describes the dataset type (group <- c(rep("mg", 19), rep("mt", 19))). I'd then find a reference proportional to CLR, and explore how features relate to this reference in the mg/mt datasets, similar to Figure 5 in "A field guide for the compositional analysis of any-omics data"

johnne avatar May 24 '25 05:05 johnne