guacamole joint caller: implement filters

trafficstars

Some filters we should probably have (going by artifacts I've observed in real data)

Strand bias
Variants reads mostly start or end at the same place
Variants toward the end of homopolymers (perhaps take into account strand)
SNVs near indels
Reads in normal sample have lots of mismatches or poor alignments in the region

Some of these are implemented here https://github.com/hammerlab/guacamole/tree/master/src/main/scala/org/hammerlab/guacamole/filters and can be adapted for the joint caller.

For things directly involving base qualities, mapping qualities, and read depths I'd like to try integrating these directly into the likelihood computation when possible rather than having them as filters. For example, instead of having a threshold filter on read depth, it may be possible to parameterize our prior likelihood on there being a somatic variant in a more intuitive way involving minimum read depth.

Here are (some of) the filters Mutect supports: http://www.nature.com/nbt/journal/v31/n3/fig_tab/nbt.2514_T1.html

Since we're joint calling, one question is what sample should these filters run on. I think the simplest thing to do is run them on the pooled sequencing data, but another option to experiment with may be to run them on the samples that trigger a call, and then only keep the call if there are any samples that trigger the call and aren't filtered out.

Any filtered calls should be written out to a VCF with the FILTER field set to explain why it was filtered.

@e5c is down to take a stab at this once we have #384 merged and running on the cluster, assigning to her

Feb 16 '16 19:02 timodonnell

One of the most important filters to have: filter somatic calls that have evidence in the normal (i.e. cases where the variant is present in the normal at too low a fraction to result in a germline call but too high a fraction to be due to just chance). A possible extension here would be tolerating a parameterized amount of tumor contamination in normal, although can also leave that for later

Feb 17 '16 18:02 timodonnell

My PR https://github.com/hammerlab/guacamole/pull/394 has the filters you mentioned in the mutect-like somatic caller. Eventually these filters need to be tested and refactored out in such a way that the joint caller can use them easily.

Mar 13 '16 17:03 jstjohn

Mutect filters / annotations we need (much of this is based on here)

[X] Evidence in normal
[X] Strand bias
[X] triallelic site
[ ] proximal gap
[ ] poor mapping
[ ] clustered position
[ ] panel of normals (blacklist): does the Broad publish their panel of normals? if not we should make our own with whatever sequencing data we have available. Will probably involve implementing something similar to Mutect's --artifact_detection_mode.
[ ] dbSNP (less trusted blacklist)
[ ] COSMIC (white list)

Other annotations

[ ] Median mismatches per read of variant-supporting vs. reference-supporting reads
[ ] For variant reads that have duplicates, do all reads for that fragment support the variant?
[ ] Mean / median base quality of variant vs. reference bases

Somewhat helpful diagram from the talk above:

Apr 24 '16 18:04 timodonnell

guacamole guacamole copied to clipboard

joint caller: implement filters

guacamole
guacamole copied to clipboard