datproject-discussions icon indicating copy to clipboard operation
datproject-discussions copied to clipboard

Bioinformatics use case (RNA-Seq analysis)

Open joehand opened this issue 8 years ago • 25 comments

From @olgabot on July 6, 2014 22:55

Hi @jbenet and @maxogden! Thank you so much for the time you took to meet with @mlovci and me this weekend. Here's an overview of our current data management situation and what our ideal case would be.

What we have now

Currently, we host a datapackages.json file which contains resources with the names "experiment_design" (metadata on the samples, e.g. celltype and colors to plot them with), "expression" (gene expression data), "splicing" (scores of which version of a gene was used). Then, at the end of the file, we have an attribute called "species" (e.g. "hg19" for the human genome build 19) that only works with hg19 and mm10 (Mus musculus aka house mouse genome build 10) because it points to the URL "http://sauron.ucsd.edu/flotilla_projects/<SPECIES>/datapackage.json", which we hand-curated. So if the data we use is from one of these two species, we can grab the data.

Try this:

On a command line:

git clone [email protected]:YeoLab/flotilla
cd flotilla
pip install -e .

In Python:

import flotilla
study = flotilla.embark("http://sauron.ucsd.edu/flotilla_projects/neural_diff_chr22/datapackage.json")

This will load the data from our server from sauron.ucsd.edu, and since you haven't downloaded anything with that filename yet, it will download it. Additionally, this is a test dataset with only information from human chromosome 22, so it is loadable on a regular laptop. Feel free to look through the code and JSON files. flotilla.data_model.Study.from_data_package does most of the heavy lifting in loading the data. Keep in mind that the project is also in a pre-alpha stage, and has a long way to go :)

What we would like

Two major issues are:

  • Get the data in the neural_diff_chr22 datapackage into a pandas.DataFrame object which can then be imported into flotilla.
    • Currently this is managed by the URL in the datapackage.json file for that file, but it should first check locally for the data and be able to be loaded offline, if you already have the data downloaded.
  • Grab related data, e.g. descriptions of genes and their functions given an ID like ENSG00000100320 and get the "gene symbol" (i.e. the familiar name that we know it by) of RBFOX2 and that this gene is an RNA-binding protein involved in alternative splicing and neural development.
    • Currently this is is managed by the "species" attribute, but ideally it would be something like ENSEMBL_v75_homo_sapiens which would link to the human data here: http://uswest.ensembl.org/info/data/ftp/index.html and then grab gene annotation (gtf files)/sequence information (fasta files) as necessary by the analysis.
    • Relatedly, there is apparently an "eHive" system on ENSEMBL for data processing. I haven't explored it yet, but it may be good to be aware of.
    • Another major issue is how to merge analyses of different species' data. For example, the ENSEMBL website has mappings of human and mouse versions of genes that we could use to compare gene expression. Plus there's the HAVANA project which categorizes orthologous (evolutionarily related) genes between different vertebrates. But what if I want to compare across non-traditional species? And many of them, not just between two? I would like to be able to easily grab these data, submit a job (either to our local supercomputer or to Amazon AWS) which runs a script that outputs a mapping with some unique keys that you could merge all your different data on.

Ideally, we could do something like this:

study = flotilla.embark('neurons')

Which would fetch our mouse and human neuron data, which has some kind of link to ENSEMBL and attach all the relevant metadata about mouse and human genes, and give common keys where possible.

@mlovci - please add on with anything I missed.

Copied from original issue: maxogden/dat#135

joehand avatar Jun 17 '16 18:06 joehand