studyGroup
studyGroup copied to clipboard
DRY data munging in Genomics : application to Variant Calling
this is a course draft, no dates are fixed yet but this will change soon, comments are welcome
Do-Not-Repeat-Yourself
In this 3 parts hacky hours we will learn how to get from raw data to information. We will develop a standard pipeline in Python that we can run locally as well as on a cluster for more robustness.
Our hacky hour is applied to a specific domain, but the principles are transferrable to any other field.
What will we be doing ?
We will be managing raw data, DNA out of sequencers in Fastq format. For simplicity sake we will be using Illumina MiSeq data, because usually they are smaller and are more suitable for tutorial/hack day
We will learn how to :
(week) DAY 1
- write a python pipeline using Ruffus
- wrap publically available tools
- code versioning using git
(week) DAY 2
- developing a pipeline to find somatic mutations in cancer DNA
- scaling the pipeline for cluster use (support high volume of data)
(week) DAY 3
- Plotting the data using matplotlib
- Reproduce plots using R's ggplot2
Bonus
- Using Docker to ship our code/results
- Public repo with the tutorial's code
- Follow up for problem solving using CodersCrowd
What environment should we use ?
Environment should be specific to each analysis. For our study group, we will be using virtualenv
and I will show you how we can even exchange these environment with other developers
What should we know a priori for this hacky hour ?
A little bit of python, an object oriented spirit and some cookies and coffee
What should we be able to do after this course ?
Something similar to this :
allele frequencies clustering (positions vs samples)
coverage across positions violin plot
target capture CDF
Zygosity matrix
Coverage heatmap across amplicon regions
In target vs Out of targets
Mapped Unmapped Reads
Mapping Qualities
This would be great! I'm really looking forward to learning how to handle DNA sequencing data (all my experience is with RNA-Seq).
Out of curiosity, what are you using to map and process the reads? The reason I ask is that I'm anticipating WIndows users having compatibility issues with UNIX-only parts of the pipeline.
@kazi11 I will be using bowtie2
and this will be part of wrap publically available tools
section. That said it can be bwa
as well or any short read aligner.
I never used these tools on Windows, but I don't expect issues with it. That will be a nice test as well for reproducibility :)
We can use Docker on Windows too, that way we can all use the same OS !
Huh, so I just checked the bowtie2
documentation and it totally runs on Windows. Tophat2
(an RNA-seq aligner that uses bowtie2
) doesn't, so I assumed that bowtie2
had the same problem. Apparently not (false alarm!).
If you have the right compiler and the source code I guess it should work easily on windows (sometimes easier to say than to do when it comes to compiling source codes )
Hey @radaniba,
This looks awesome! Couple of comments:
- I see you've got version control listed as 1/3 of one lesson; we spent a whole hour on this just a few weeks ago. Is it possible to teach this lesson without git? You know I'm a huge fan of git, but maybe best to tackle one thing at a time, and keep the focus on the main material.
- This looks (to me) like a lot of heavy machinery - what's the plan to help as many people as possible participate? It'd be good to start by identifying all the dependencies we'll need, and figure out what we can do to smooth installation / setup (kind of sounds like you already have a plan though :)
- This is pretty python heavy; I'm a huge python fan, but I think a lot of people come particularly for the R (though I could be wrong! If anyone wants to do some python, say so here!). Would it be possible to explore these same ideas in R?
- This looks really genomics specific (which is great, there are tons of genomics people in the group!); will non-genomics people get much out of this? I'd be delighted to have a series of genomics-targeted events, but I just want to make sure we communicate that to people if that's the case, so that everyone knows what they're getting into.
Hi @BillMills ,
Thanks for your comments
I see you've got version control listed as 1/3 of one lesson; we spent a whole hour on this just a few weeks ago. Is it possible to teach this lesson without git? You know I'm a huge fan of git, but maybe best to tackle one thing at a time, and keep the focus on the main material.
You're right, I agree, if this will save us precious time, that would be great, there is a lot of material in there and I am aware it takes a lot of time to finish that kind of program
This looks (to me) like a lot of heavy machinery - what's the plan to help as many people as possible participate? It'd be good to start by identifying all the dependencies we'll need, and figure out what we can do to smooth installation / setup (kind of sounds like you already have a plan though :)
Yes, I planned to allocate some time to make participants install the dependencies, not a big deal though, the usual process, pip and virtualenv, the entire process will be done at 10 min or so
This is pretty python heavy; I'm a huge python fan, but I think a lot of people come particularly for the R (though I could be wrong! If anyone wants to do some python, say so here!). Would it be possible to explore these same ideas in R?
That would be great (besides this is a subject of a new issue I will open here). That said, I always wrote pipelines in python and I am not aware of R frameworks that are equivalent to the make-like family of utilities that make it possible to 'easily' develop a pipeline (the whole purpose is making participants familiar with automation). If anyone is aware of R packages to make pipelines, please let me know.
This looks really genomics specific (which is great, there are tons of genomics people in the group!); will non-genomics people get much out of this? I'd be delighted to have a series of genomics-targeted events, but I just want to make sure we communicate that to people if that's the case, so that everyone knows what they're getting into.
Yes it is (my daily routine). This is applied but the protocol can fit any other application. I would be more than happy to collaborate with someone from another field and do the same (even though it will be out of my comfort zone), this will demonstrate that the process is reproducible at some extent (any idea's welcome, I love challenges :) )
I agree with you, this is a bit too specific, why don't we make this a final step of some intermediate shorter session, I am thinking of :
- Writing simple pipelines in Python
- Data analysis with R and Python together
- Data visualization in R and Python
We can break into smaller units before attacking such heavy processes
Does this make sense ?
PS : see my next issue
Yeah, I don't think there are any R frameworks for pipeline programming. But hey, I really want to learn how to throw together a pipeline in Python (could be a really nice alternative to the rather finicky shell scripts I've been writing).
And @BillMills, I think there would be a lot of useful information here even for non-genomics peeps. Writing pipelines has a lot of utility for a large number of fields. Genomics is just a really good example to learn from.
Is it possible to throw together a Python pipeline without basic Python knowledge?
Yes, with a python crash course this is totally doable
Cool that sounds great! Really interested in Rython tool kit as well. Let's schedule a session soon?
So how about these sessions - could be a good opportunity to do some events at BC Cancer.
Oh that would be great, may be @minisciencegirl can check with VanBug team ( @minisciencegirl are you part of their core team ? ) they have easy access to facilities here so it can be a good start
Hi Rad,
I am on their Dev team. I can check with Kirin and Will about booking rooms. Let me know a time that works and I can get the ball rolling. Would it be possible to schedule these for later in the afternoon?
Cheers,
Amy
On Jun 23, 2015, at 9:32 AM, Radhouane Aniba [email protected] wrote:
Oh that would be great, may be @minisciencegirl can check with VanBug team ( @minisciencegirl are you part of their core team ? ) they have easy access to facilities here so it can be a good start
— Reply to this email directly or view it on GitHub.