studyGroup DRY data munging in Genomics : application to Variant Calling

this is a course draft, no dates are fixed yet but this will change soon, comments are welcome

Do-Not-Repeat-Yourself

In this 3 parts hacky hours we will learn how to get from raw data to information. We will develop a standard pipeline in Python that we can run locally as well as on a cluster for more robustness.

Our hacky hour is applied to a specific domain, but the principles are transferrable to any other field.

What will we be doing ?

We will be managing raw data, DNA out of sequencers in Fastq format. For simplicity sake we will be using Illumina MiSeq data, because usually they are smaller and are more suitable for tutorial/hack day

We will learn how to :

(week) DAY 1

write a python pipeline using Ruffus
wrap publically available tools
code versioning using git

(week) DAY 2

developing a pipeline to find somatic mutations in cancer DNA
scaling the pipeline for cluster use (support high volume of data)

(week) DAY 3

Plotting the data using matplotlib
Reproduce plots using R's ggplot2

Bonus

Using Docker to ship our code/results
Public repo with the tutorial's code
Follow up for problem solving using CodersCrowd

What environment should we use ?

Environment should be specific to each analysis. For our study group, we will be using virtualenv and I will show you how we can even exchange these environment with other developers

What should we know a priori for this hacky hour ?

A little bit of python, an object oriented spirit and some cookies and coffee

What should we be able to do after this course ?

Something similar to this :

allele frequencies clustering (positions vs samples)

allele frequencies

coverage across positions violin plot

coverage

target capture CDF

capture

Zygosity matrix

zygosity

Coverage heatmap across amplicon regions

coverage in amplicon

In target vs Out of targets

in_out

Mapped Unmapped Reads

mapped unmapped

Mapping Qualities

mapping quality

May 11 '15 17:05 radaniba

This would be great! I'm really looking forward to learning how to handle DNA sequencing data (all my experience is with RNA-Seq).

Out of curiosity, what are you using to map and process the reads? The reason I ask is that I'm anticipating WIndows users having compatibility issues with UNIX-only parts of the pipeline.

May 11 '15 18:05 jstaf

@kazi11 I will be using bowtie2 and this will be part of wrap publically available tools section. That said it can be bwa as well or any short read aligner. I never used these tools on Windows, but I don't expect issues with it. That will be a nice test as well for reproducibility :) We can use Docker on Windows too, that way we can all use the same OS !

May 11 '15 19:05 radaniba

Huh, so I just checked the bowtie2 documentation and it totally runs on Windows. Tophat2 (an RNA-seq aligner that uses bowtie2) doesn't, so I assumed that bowtie2 had the same problem. Apparently not (false alarm!).

May 11 '15 19:05 jstaf

If you have the right compiler and the source code I guess it should work easily on windows (sometimes easier to say than to do when it comes to compiling source codes )

May 11 '15 20:05 radaniba

Hey @radaniba,

This looks awesome! Couple of comments:

I see you've got version control listed as 1/3 of one lesson; we spent a whole hour on this just a few weeks ago. Is it possible to teach this lesson without git? You know I'm a huge fan of git, but maybe best to tackle one thing at a time, and keep the focus on the main material.
This looks (to me) like a lot of heavy machinery - what's the plan to help as many people as possible participate? It'd be good to start by identifying all the dependencies we'll need, and figure out what we can do to smooth installation / setup (kind of sounds like you already have a plan though :)
This is pretty python heavy; I'm a huge python fan, but I think a lot of people come particularly for the R (though I could be wrong! If anyone wants to do some python, say so here!). Would it be possible to explore these same ideas in R?
This looks really genomics specific (which is great, there are tons of genomics people in the group!); will non-genomics people get much out of this? I'd be delighted to have a series of genomics-targeted events, but I just want to make sure we communicate that to people if that's the case, so that everyone knows what they're getting into.

May 18 '15 00:05 bkatiemills

Hi @BillMills ,

Thanks for your comments

I see you've got version control listed as 1/3 of one lesson; we spent a whole hour on this just a few weeks ago. Is it possible to teach this lesson without git? You know I'm a huge fan of git, but maybe best to tackle one thing at a time, and keep the focus on the main material.

You're right, I agree, if this will save us precious time, that would be great, there is a lot of material in there and I am aware it takes a lot of time to finish that kind of program

This looks (to me) like a lot of heavy machinery - what's the plan to help as many people as possible participate? It'd be good to start by identifying all the dependencies we'll need, and figure out what we can do to smooth installation / setup (kind of sounds like you already have a plan though :)

Yes, I planned to allocate some time to make participants install the dependencies, not a big deal though, the usual process, pip and virtualenv, the entire process will be done at 10 min or so

This is pretty python heavy; I'm a huge python fan, but I think a lot of people come particularly for the R (though I could be wrong! If anyone wants to do some python, say so here!). Would it be possible to explore these same ideas in R?

That would be great (besides this is a subject of a new issue I will open here). That said, I always wrote pipelines in python and I am not aware of R frameworks that are equivalent to the make-like family of utilities that make it possible to 'easily' develop a pipeline (the whole purpose is making participants familiar with automation). If anyone is aware of R packages to make pipelines, please let me know.

This looks really genomics specific (which is great, there are tons of genomics people in the group!); will non-genomics people get much out of this? I'd be delighted to have a series of genomics-targeted events, but I just want to make sure we communicate that to people if that's the case, so that everyone knows what they're getting into.

Yes it is (my daily routine). This is applied but the protocol can fit any other application. I would be more than happy to collaborate with someone from another field and do the same (even though it will be out of my comfort zone), this will demonstrate that the process is reproducible at some extent (any idea's welcome, I love challenges :) )

I agree with you, this is a bit too specific, why don't we make this a final step of some intermediate shorter session, I am thinking of :

Writing simple pipelines in Python
Data analysis with R and Python together
Data visualization in R and Python

We can break into smaller units before attacking such heavy processes

Does this make sense ?

PS : see my next issue

May 18 '15 00:05 radaniba

Yeah, I don't think there are any R frameworks for pipeline programming. But hey, I really want to learn how to throw together a pipeline in Python (could be a really nice alternative to the rather finicky shell scripts I've been writing).

And @BillMills, I think there would be a lot of useful information here even for non-genomics peeps. Writing pipelines has a lot of utility for a large number of fields. Genomics is just a really good example to learn from.

May 18 '15 01:05 jstaf

Is it possible to throw together a Python pipeline without basic Python knowledge?

May 19 '15 04:05 minisciencegirl

Yes, with a python crash course this is totally doable

May 19 '15 04:05 radaniba

Cool that sounds great! Really interested in Rython tool kit as well. Let's schedule a session soon?

May 19 '15 04:05 minisciencegirl

So how about these sessions - could be a good opportunity to do some events at BC Cancer.

Jun 17 '15 17:06 bkatiemills

Oh that would be great, may be @minisciencegirl can check with VanBug team ( @minisciencegirl are you part of their core team ? ) they have easy access to facilities here so it can be a good start

Jun 23 '15 16:06 radaniba

Hi Rad,

I am on their Dev team. I can check with Kirin and Will about booking rooms. Let me know a time that works and I can get the ball rolling. Would it be possible to schedule these for later in the afternoon?

Cheers,

Amy

On Jun 23, 2015, at 9:32 AM, Radhouane Aniba [email protected] wrote:

Oh that would be great, may be @minisciencegirl can check with VanBug team ( @minisciencegirl are you part of their core team ? ) they have easy access to facilities here so it can be a good start

— Reply to this email directly or view it on GitHub.

Jun 23 '15 19:06 minisciencegirl

studyGroup studyGroup copied to clipboard

DRY data munging in Genomics : application to Variant Calling

Do-Not-Repeat-Yourself

What will we be doing ?

What environment should we use ?

What should we know a priori for this hacky hour ?

What should we be able to do after this course ?

studyGroup
studyGroup copied to clipboard