MultiAssayExperiment icon indicating copy to clipboard operation
MultiAssayExperiment copied to clipboard

Using intersectRows when different names are used for the same entity

Open llrs opened this issue 6 years ago • 8 comments

I have one dataset of 16 S sequencing of intestinal biopsies and another one from the stools which end up into different OTUs. I can find to which taxa does each OTU belong to and in the phylogenetic analysis they are usually merged into a single object (phyloseq, metagenomeSeq) extending the rowData (I assume), or could be stored in rowData because the names of the OTUs (I have OTU_1, OTU_2, ...) aren't really meaningful. What is meaningful is the taxonomy I have in a matrix that is in those objects (phylo-class, MRexperiment-class).

See example output:

MR_i  ## And MR_s is a similar object
## MRexperiment (storageMode: environment)
## assayData: 499 features, 103 samples 
##   element names: counts 
## protocolData: none
## phenoData
##   sampleNames: 5.B009 4.B008 ... 103.B104 (103 total)
##   varLabels: Sample_Code Patient_ID ... ID (12 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: OTU_1 OTU_10 ... OTU_998 (499 total)
##   fvarLabels: Domain Phylum ... Species (7 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  

(MAE  <- MultiAssayExperiment(experiments = list("intestinal" = MR_i, "stools" = MR_s), colData = meta))
## A MultiAssayExperiment object of 2 listed
##  experiments with user-defined names and respective classes. 
##  Containing an ExperimentList class object of length 2: 
##  [1] intestinal: MRexperiment with 499 rows and 103 columns 
##  [2] stools: MRexperiment with 535 rows and 103 columns 
## Features: 
##  experiments() - obtain the ExperimentList instance 
##  colData() - the primary/phenotype DataFrame 
##  sampleMap() - the sample availability DataFrame 
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment 
##  *Format() - convert into a long or wide DataFrame 
##  assays() - convert ExperimentList to a SimpleList of matrices

When I build one of MAE object with them and I use intersectRows I end up with those with the same name but different taxonomic classification.

intersectRows(MAE)
## A MultiAssayExperiment object of 2 listed
##  experiments with user-defined names and respective classes. 
##  Containing an ExperimentList class object of length 2: 
##  [1] intestinal: MRexperiment with 235 rows and 103 columns 
##  [2] stools: MRexperiment with 235 rows and 103 columns 
## Features: 
##  experiments() - obtain the ExperimentList instance 
##  colData() - the primary/phenotype DataFrame 
##  sampleMap() - the sample availability DataFrame 
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment 
##  *Format() - convert into a long or wide DataFrame 
##  assays() - convert ExperimentList to a SimpleList of matrices
c(head(rownames(b)[[1]]), tail(rownames(b)[[1]]))
## [1] "OTU_1"   "OTU_10"  "OTU_100" "OTU_101" "OTU_102" "OTU_103" "OTU_94"  "OTU_95"  "OTU_96"  "OTU_97"  "OTU_98"  "OTU_99" 

Instead the OTU_1073 from intestinal assay and the OTU_1037 from the stools assay are the same species.

Could intersectRows use the rowData (or fvarLabels) of each experiment if available to reorder(?) and select the rows of the experiment?

Also if I have metagenomics and RNA-seq assays in the same object, I would like to tell intersectRows which experiments to subset by row. I could be interested in just one Phylum and relate it to the other assays on the experiment.

The package looks great, thanks for the effort!

llrs avatar Dec 04 '17 17:12 llrs

Hi Lluís, @llrs Thank you for the report. The assumption here is that all the objects in the ExperimentList support a rowData method. It would be good to make use of this data perhaps we could add a byRowData argument. Regards, Marcel

LiNk-NY avatar Dec 07 '17 00:12 LiNk-NY

I tried building another object (SummarizedExperiment) with the same data:

MultiAssayExperiment(list("intestinal" = SE_i, "stools" = SE_s))
## A MultiAssayExperiment object of 2 listed
##  experiments with user-defined names and respective classes. 
##  Containing an ExperimentList class object of length 2: 
##  [1] intestinal: SummarizedExperiment with 532 rows and 178 columns 
##  [2] stools: SummarizedExperiment with 568 rows and 152 columns 
## Features: 
##  experiments() - obtain the ExperimentList instance 
##  colData() - the primary/phenotype DataFrame 
##  sampleMap() - the sample availability DataFrame 
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment 
##  *Format() - convert into a long or wide DataFrame 
##  assays() - convert ExperimentList to a SimpleList of matrices
colData(mae)
## DataFrame with 330 rows and 0 columns

But then my problem is how to encode the colData, see this question in the support site.

It might be for another enhancement but using each SummarizedExperiment's colData to create a common colData would simplify the creation of the MAE objects. It would have many caveats but maybe looking for common columns and creating a column for the row names of each sample in the SummarizedExperiment would work.

llrs avatar Dec 07 '17 10:12 llrs

@LiNk-NY I wonder if the enhancement should be more general than byRowData - how about function signatures for subsetByRow and subsetByColumn, where the function is something that will be applied to each list element? Something like:

setMethod("subsetByRow", c("ExperimentList", "function"), function(x, y) {
   sublist <- lapply(x, y)
   x <- subsetByRow(x, sublist)
   x
})

This could be used for subsetting by rowData (although with more complicated user syntax than a more specific subsetByrowData), but also for filtering by row means, variance, etc.

lwaldron avatar Dec 08 '17 17:12 lwaldron

I think Martin @mtmorgan would say, you want to define a method for a class rather than a function. And the desired functionality should either conform to the MultiAssayExperiment API or extend the class.

(Martin, feel free to chime in)

LiNk-NY avatar Apr 18 '18 23:04 LiNk-NY

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 02 '19 16:01 stale[bot]

It's been a while but are there some updates?

I'm commenting to prevent the bot closing the issue

llrs avatar Jan 02 '19 16:01 llrs

Hi Lluís, @llrs

What you describe seems to require a row map structure where subsets can be done based on a third variable. We don't have something like that planned in the immediate future although it is an important problem to tackle. FWIW, we do have helper functions to homogenize rows across experiments in TCGAutils (see symbolsToRanges and mirToRanges). Perhaps you can write a function that will do this for you in terms of matching and re-ordering OTU rows across experiments using a map. You could then use a list or List or row names to subset.

If you are working with a consistent number of samples ('colnames') and rows, it may also be worthwhile to look into data structures that make use of a row graph representation such as LoomExperiment.

Best regards, Marcel

LiNk-NY avatar Jan 02 '19 17:01 LiNk-NY

Just discussed this with @LiNk-NY. This should provide a workable solution with minimal change:

  • the subsetByRow() function should provide an i argument that allows you specify which experiments will be subset, with the default being all.

Other helper functions subsetByRowData() and intersectByRowData() would also be useful. These would provide an additional argument for the column name of the rowData to use instead of column names. They would silently do nothing for any experiments that either 1) don't have rowData, or 2) don't have the specified colname in their rowData.

lwaldron avatar Jan 04 '19 15:01 lwaldron