Support SummarizedExperiment as input
Description of feature
SummarizedExperiment objects are widely used Bioconductor data structure that keeps track of column, row annotations and multiple data matrices. Using SummarizedExperiments avoids some common pitfalls such as screwing up the order of the annotation data frames, or subsetting the matrix, but not the annotation etc.
I would find it useful to support SummarizedExperiment rds objects as an alternative input to specifying a count matrix, row and column annotations as TSV/CSV files.
nf-core rnaseq (and potentially others) already generate such an object. Using it as input would make it easier as it's only a single file to specify instead of 3.
@pinin4fjords, please LMK if you would be open to support that.
How would you see the implementation of that? Would we have a process early on that split out the components? I wouldn't want to have something that mandated R throughout the pipeline in order to allow this.
We just need to agree on one canonical format that's used everywhere. One of the first steps of the pipeline should be to convert whatever input format (SummarizedExperiment, CSV, TSV, SOFT, ...) to that format.
If everything is R-based (I think right now it is?), I see some advantages to use SummarizedExperiments everywhere. There's btw. also interoperability with python nowadays: https://github.com/BiocPy/rds2py
But if you prefer, we can also settle on tabular formats.
Main advantages of binary format IMO:
- no need to 'guess' the data types
- faster IO
- avoid mistakes with metadata handling when using SummarizedExperiment
No, I don't want to lock the pipeline to using R everywhere. There are a number of processes not using it (GSEA, matrix filtering come to mind), and I don't want to cut us off from non-R modules in future, or start getting into hacky inter-language things (they never work that well).
At least in the first instance, I think this should be an unpacking of a summarizedexperiment object into the tabular formats the workflow is currently using.
start getting into hacky inter-language things (they never work that well).
I used to share that sentiment, but the situation has improved lately, by, well, not using hacks (such as reticulate or rpy2) anymore.
-
rds2pyis based on python bindings to a C++ library that can read rds natively - anndataR uses native R libraries to read hdf5 format.
That said, I can live with using text formats. Despite its drawbacks, it is at least universally supported.
Would something like Parquet do the job too? Better support across languages etc... not sure this helps a lot though, but it does speed up things https://parquet.apache.org/
If we were going to use one of these formats within the workflow, I think annData would be the one to go with. Good cross-language support, adoption etc.
But then we'd have to update all the modules, and force other users of those modules to use those formats. In general the files used with this workflow are not big enough to worry about I/O speed over-much. I'd rather keep to simple formats until we have a very compelling reason not to.
One motivation would also be to make the interoperability with nf-core/rnaseq (and possibly other pipelines) simpler:
Outputs from nf-core/rnaseq and other tximport-processed results The nf-core RNAseq workflow incorporates tximport for producing quantification matrices. From version 3.12.2, it additionally provides transcript length matrices which can be directly consumed by DESeq2 to model length bias across samples.
To use this approach, include the transcript lengths file with the raw counts:
--matrix 'salmon.merged.gene_counts.tsv'
--transcript_length_matrix 'salmon.merged.gene_lengths.tsv'
IMO it would be much more convenient to specify the rds object generated by rnaseq to automatically follow best practices.
We built it this the way it is to make it somewhat agnostic to upstream pipelines. RNA-seq outputs RDS, many other things don't.
As I say, happy to take RDS as a way of supplying pre-checked matrices, annotations etc we then unpack and pass to the existing input channels.