ggbio icon indicating copy to clipboard operation
ggbio copied to clipboard

feature request from christ

Open tengfei opened this issue 11 years ago • 0 comments

Hi

Thanks for replying, I'll try to explain what I'm doing and why. The group I work in is producing a lot of DNA sequencing data in cancer samples. When the number of samples is large enough, you start to see that certain genes are recurrently mutated; classic oncogenes and tumour suppressors like TP53, BRCA2, KRAS and so on, but you also see some new things, too. The issue is that we're going to have a lot of data and I want to find a nice way of giving the postdocs a way of accessing the data on their own, without me having to produce their plots for them. We've got an internal website (just a Linux server with Apache installed and some MySQL databases) and in the past we've used it to allow access to our expression-array and survival data, so users can explore the survival of different patients or expression of their genes.

This is done using a mix of PHP, Perl and R, but I'm going to (hopefully) move things to Shiny this year (http://www.rstudio.com/shiny/). I've tested this with ggplot2 (thanks to the link below) and it seems to work pretty well, so I was hoping to plug ggbio in there and I'll be happy.

http://grrrraphics.blogspot.co.uk/2013/01/shiny-desolve-and-ggplot-play-nicely.html

The diagram I sent you was from a paper I saw recently is a simple representation of mutations detected in TP53; you can have different mutations at the same site (and at different frequencies or patients), so multiple labels and/or symbols are fairly important. The diagram has no protein domains listed on it, but this data could be pulled out of Uniprot, probably using the biomaRt package from bioconductor.

http://bioconductor.org/packages/release/bioc/html/biomaRt.html

An example of the domains TP53 gene is shown here - I guess I'm interested in something like the "sequence features summary" at the top, but with a more square and angular look:

http://www.ebi.ac.uk/interpro/IProtein?ac=P04637

The fullest description of proteins is given by Uniprot, which aggregates various databases:

http://www.uniprot.org/uniprot/P04637

The "regions" subsection of the "sequence annotation" section is particularly useful, as it contains a graphical representation of the regions (shown as green rectangles on the right hand side). Finally, the "secondary structure" of the protein is also pretty useful, showing where predicted helices, turns and strands are.

Our data is stored in standard VCF files (desribed below), which contain a wealth of information. You can imagine them as a single row per mutation, with an extra column to show the genotype of each new sample (i.e. a n by m matrix by n=number of mutations and m=number of samples). As these files can get very big (several GB) I was planning on putting them in a database which would call only the data we wanted to be plotted, rather than storing GB in enormouse R dataframes as I would if I was working with the whole data set as I do every day.

http://en.wikipedia.org/wiki/Variant_Call_Format

I hope this email was useful and not too long or too detailed. To the best of my knowledge there are no good tools to produce these kind of plots and they would look particularly good when rendered using ggplot.

Enjoy your weekend Chris

tengfei avatar May 10 '13 18:05 tengfei