varCA
varCA copied to clipboard
don't use gatk selectvariants + allow varca to output SNVs and indels in the same VCF
Many callers output both InDels and SNVs in the same VCF.
In order to separate them from each other before outputting to the final TSVs, we use gatk SelectVariants
. It conveniently allows us to keep GVCF blocks where there are no variants. However, there is no way to request it to label InDels as no-call when selecting SNVs and vice versa. It simply filters them out. This means that we lack depth and other valuable information at those sites.
I haven't been able to find a tool that achieves the behavior that we want, so I think we might have to write a custom script. We already have the classify.awk
script, but it doesn't really work for every type of VCF ALT allele and it can only accept REF and ALT columns as input (and nothing else).
We should
- [ ] modify
classify.awk
to work with- [ ] BND alleles
- [ ] MIXED alleles
- [ ] write a bash script to filter VCFs using
classify.awk
New plan
Create a new branch and merge the indel and SNP pipelines into one. We didn't really need them to be separate in the first place, since classify.awk
will binarize them anyway.
By merging them, we won't even have to deal with the problem of separating SNPs and indels and there won't be a need to use gatk SelectVariants
to begin with
This will also significantly simplify the prepare
pipeline
And, it would open up the possibility for us to use multilabel classification later on down the road. We could use multilabel classification to allow varca to output both SNVs and indels in the same VCF.