guacamole
guacamole copied to clipboard
joint caller: implement filters
Some filters we should probably have (going by artifacts I've observed in real data)
- Strand bias
- Variants reads mostly start or end at the same place
- Variants toward the end of homopolymers (perhaps take into account strand)
- SNVs near indels
- Reads in normal sample have lots of mismatches or poor alignments in the region
Some of these are implemented here https://github.com/hammerlab/guacamole/tree/master/src/main/scala/org/hammerlab/guacamole/filters and can be adapted for the joint caller.
For things directly involving base qualities, mapping qualities, and read depths I'd like to try integrating these directly into the likelihood computation when possible rather than having them as filters. For example, instead of having a threshold filter on read depth, it may be possible to parameterize our prior likelihood on there being a somatic variant in a more intuitive way involving minimum read depth.
Here are (some of) the filters Mutect supports: http://www.nature.com/nbt/journal/v31/n3/fig_tab/nbt.2514_T1.html
Since we're joint calling, one question is what sample should these filters run on. I think the simplest thing to do is run them on the pooled sequencing data, but another option to experiment with may be to run them on the samples that trigger a call, and then only keep the call if there are any samples that trigger the call and aren't filtered out.
Any filtered calls should be written out to a VCF with the FILTER field set to explain why it was filtered.
@e5c is down to take a stab at this once we have #384 merged and running on the cluster, assigning to her
One of the most important filters to have: filter somatic calls that have evidence in the normal (i.e. cases where the variant is present in the normal at too low a fraction to result in a germline call but too high a fraction to be due to just chance). A possible extension here would be tolerating a parameterized amount of tumor contamination in normal, although can also leave that for later
My PR https://github.com/hammerlab/guacamole/pull/394 has the filters you mentioned in the mutect-like somatic caller. Eventually these filters need to be tested and refactored out in such a way that the joint caller can use them easily.
Mutect filters / annotations we need (much of this is based on here)
- [X] Evidence in normal
- [X] Strand bias
- [X] triallelic site
- [ ] proximal gap
- [ ] poor mapping
- [ ] clustered position
- [ ] panel of normals (blacklist): does the Broad publish their panel of normals? if not we should make our own with whatever sequencing data we have available. Will probably involve implementing something similar to Mutect's
--artifact_detection_mode. - [ ] dbSNP (less trusted blacklist)
- [ ] COSMIC (white list)
Other annotations
- [ ] Median mismatches per read of variant-supporting vs. reference-supporting reads
- [ ] For variant reads that have duplicates, do all reads for that fragment support the variant?
- [ ] Mean / median base quality of variant vs. reference bases
Somewhat helpful diagram from the talk above:
