varCA icon indicating copy to clipboard operation
varCA copied to clipboard

don't use gatk selectvariants + allow varca to output SNVs and indels in the same VCF

Open aryarm opened this issue 5 years ago • 1 comments

Many callers output both InDels and SNVs in the same VCF. In order to separate them from each other before outputting to the final TSVs, we use gatk SelectVariants. It conveniently allows us to keep GVCF blocks where there are no variants. However, there is no way to request it to label InDels as no-call when selecting SNVs and vice versa. It simply filters them out. This means that we lack depth and other valuable information at those sites.

I haven't been able to find a tool that achieves the behavior that we want, so I think we might have to write a custom script. We already have the classify.awk script, but it doesn't really work for every type of VCF ALT allele and it can only accept REF and ALT columns as input (and nothing else). We should

  • [ ] modify classify.awk to work with
    • [ ] BND alleles
    • [ ] MIXED alleles
  • [ ] write a bash script to filter VCFs using classify.awk

aryarm avatar Jul 15 '19 16:07 aryarm

New plan

Create a new branch and merge the indel and SNP pipelines into one. We didn't really need them to be separate in the first place, since classify.awk will binarize them anyway. By merging them, we won't even have to deal with the problem of separating SNPs and indels and there won't be a need to use gatk SelectVariants to begin with This will also significantly simplify the prepare pipeline And, it would open up the possibility for us to use multilabel classification later on down the road. We could use multilabel classification to allow varca to output both SNVs and indels in the same VCF.

aryarm avatar Aug 19 '19 03:08 aryarm