hap.py
hap.py copied to clipboard
Support for allele only comparisons
Hi,
Is it possible to use hap.py to make allele only comparisons? (i.e. Can the genotype check be disabled?). I know som.py does this, but I'd like to keep hap.py's haplotype resolution - only without the genotype checking.
vcf-eval appears to support this with the --squash-ploidy option.
Thanks.
Hap.py allows to do this via the --set-gt option.
--set-gt {half,hemi,het,hom}
This is used to treat Strelka somatic files Possible
values for this parameter: half / hemi / het / hom /
half to assign one of the following genotypes to the
resulting sample: 1 | 0/1 | 1/1 | ./1. This will
replace all sample columns and replace them with a
single one.
Running with --set-gt hom should achieve the same as --squash-ploidy in vcfeval.
Thanks, this looks perfect for validating against a somatic truth set. However, if I wanted to run against the Genome In a Bottle Truth set, I'm not sure this would work? Do you have a recommendation for that case?
Another way to get an idea of the allele-only performance is to run in standard mode and use the FP.GT value. FP.GT gives the number of query calls which the genotype was not correct. An updated TP / FP / FN count that represents the performance ignoring genotypes can be obtained as follows:
TP.AL = TP + FP.GT FN.AL = FN - FP.GT FP.AL = FP - FP.GT
(each FP.GT creates one FP and one FN).
Details of this are also described in the upcoming GA4GH best-practices paper: https://www.biorxiv.org/content/early/2018/02/23/270157
What does --set-gt do for a 1/2 call? vcfeval --squash-ploidy will allow either allele to match.
For a 1/2 call, --set-gt hom would produce two 1/1 records, one for each allele. Of course these cannot be haplotype matched anymore if they overlap on the reference after normalisation (xcmp would output a warning about being unable to produce haplotype pairs then). An alternative could be --set-gt het, which would at least work in the 1/2 case, with the difference that we'd require both alleles to be present in the truth + query and count a FN/FP if one of them is missing.
I can see a use case for allowing either allele to match in the scenario where we're not sure about a repeat expansion length (we could specify a range of expansions as GT 1/2/3/4...), is this the way it is intended to use? In other cases, this feels like quite a lenient way to compare.
One such use case where leniency is desired is calling variants against RNA and validating against a truth set generated from DNA.
In this case, we don't know ground truth on which alleles are actually expressed, but I'd like to get counts of expressed variants that match DNA alleles (TPs) as well as RNA calls that aren't in the DNA truth set (putative FP's). We'd like to preserve the haplotype resolution when possible as well (to account for varying caller representations).
The 1/2 matching allowing either allele to match is handy when looking for hits in a somatic database such as COSMIC (so not really a benchmarking scenario), where the 1/2 call corresponds to a somatic allele on one haplotype and a germline allele on the other. You're just looking for the presence of the somatic allele and don't want the germline allele to prevent the matching. (Of course if the somatic allele itself is a composite of germline plus somatic variation only decomposition can save you)