datproject-discussions icon indicating copy to clipboard operation
datproject-discussions copied to clipboard

Biological data integration: private + GEO/SRA/ArrayExpress data

Open joehand opened this issue 8 years ago • 12 comments

From @olgabot on June 18, 2014 5:58

Hello there, This is Olga Botvinnik and Mike Lovci (@mlovci), PhD students in Bioinformatics-type fields at UC-San Diego working on a biological data analysis package for "large" (for biology) datasets. I've looked through the repo a bit and have been wondering if this would help with our data woes.

We've been having trouble coming up with a way to store the data for our projects, which would hopefully both unpublished data and publicly available data through databases such as the USA's Gene Expression Omnibus and Europe's ArrayExpress. R's BioconductoR has a pretty nice schema for a single experiment, where each experiment has not only data but also:

  • pData (phenotype data), describing the samples in the dataset, e.g. if they're from different celltypes or different timepoints, or one is a replicate of another. This is probably most analogous to metadata as described here.
  • fData (feature data), describing the features, e.g. if it's gene expression, what kind of gene it is (like an enzyme vs structural gene)

However, this is only for one datatype at a time (plus it's in R and we prefer Python), and ultimately, we'd like to ask the killer biological questions which integrate all these data types at once. For example, we'd want to mix together gene expression and DNA mutation data and see what mutations lead to changes in gene expression, and right now the only way to do that is with hella data munging and lookups and crazy queries across data types.

What we'd like to have: For a single "biological study" which addresses some biological question (e.g. "how does mutation affect gene expression"?), be able to pull down the following, reproducibly:

  • All data produced by all experiments (unique for each study)
    • gene expression data for all samples, possibly across different timepoints
    • mutation data for all samples
    • ... any other relevant data (infinite possibilities) ...
  • All "phenotype data" aka metadata about the experiment
  • All "feature data" (not necessarily unique for each study)
    • metadata about all genes in current human genome build (yes there are builds of the human genome :) )
    • metadata about all currently known mutations in human genome
    • ... metadata on remaining datatypes ...

An example of how wild and wacky these experiments can get is a similar package written in R (made only for outputs of specific bioinformatics programs, and is not open-source) which has this data schema: http://compbio.mit.edu/cummeRbund/images/plots/cuffData_schema.pdf

Ideally, this would also include hooks for auto-downloading and generating compatible datasets from gene expression data deposited into GEO and ArrayExpress (mentioned above), and across different species, so we could compare our data to published data in human, and also look at studies from mice (in vivo, live mice) vs cells in humans (in vitro, usually samples taken from cutting off a tiny piece of skin from a person).

Additionally, I got some funding to do this kind of thing so email me ([email protected]) if you think our use case is applicable to dat.

Copied from original issue: maxogden/dat#129

joehand avatar Jun 17 '16 18:06 joehand