hap.py
hap.py copied to clipboard
happy treats variants that aren't fully within the confident regions as confident
In our deep dive into happy with DeepVariant we've noticed that hap.py includes in its confident calls for comparison any multi-base event (e.g., a deletion) that merely overlaps with the confident regions, rather than requiring that event to be full contained within the confident regions. Internally we require the event to be fully contained within the confident regions to be considered confident. We wanted to double check here that (a) hap.py intended this behavior and (b) check this is the recommended approach (overlaps, not contains) according to the GA4GH benchmarking recommendations?
I think we discussed this in the GA4GH benchmarking group and the conclusion was that this is more of a truthset rather than a comparison issue. Hap.py does stratify these out using the TS_contained (which requires the events to be fully contained) and TS_boundary tags in the VCF (and provides the corresponding precision/recall metrics in the .extended.csv file).
The reason we include both these event types in the total by default is the following:
- truth events should be fully contained in the confident regions when creating the truthset
- the truth set confident regions should cover every place that is actually considered hom-ref / called, so even partial overlaps with query calls that aren't matched with truth variants can be considered as FP (unless there are issues with the truth set -- we actually had to do some refinement of the Platinum Genomes confident regions in the latest release to deal with issues like this).
Here's a concrete example:
candidate: no call is present
truth:
20 16755600 rs369843912 AAGAAAAAAG A 50 PASS . GT:DP:ADALL:AD:GQ:IGT:IPS:PS 0/1:228:130,71:145,77:366:0/1:.:.
happy:
20 16755600 . AAGAAAAAAG A 50 . BS=16755600;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ
BED confident region:
20 16755608 16790996
Happy is tagging this variant as TS_contained but that's clearly not the case here.
Here's our exact command line:
sudo docker run -it \
-v "${ROOT}:${ROOT}" \
-v "${OUTPUT_DIR}:${OUTPUT_DIR}" \
pkrusche/hap.py /opt/hap.py/bin/hap.py \
"${TRUTH_VCF}" \
"${LABELED_VCF}" \
--preprocess-truth \
-f "${TRUTH_BED}" \
-r "${REF%.gz}" \
-o "${OUTPUT_DIR}/happy.output" \
-l ${REGION}
Picture of the region (confident region BED is on the bottom):

This variant is from HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_triophased.vcf.gz and it's associated BED file.
This is probably the result of hap.py 'fixing' the confident regions after preprocessing the truth VCF (the corresponding command line options are --adjust-conf-regions and --no-adjust-conf-regions, the default is on to work around issues with older Platinum Genomes releases) to make sure that any truth variant that just overlaps the truth confident regions will be fully contained.
In the GA4GH benchmarking setting we'd typically not preprocess the truth calls (i.e. the truth calls are treated as the 'gold standard' which we shouldn't really change), but we also use this type of preprocessing for internal assessment / methods improvement -- unfortunately, it adds some corner cases just like the one above.
I think running with --no-adjust-conf-regions should at least label the call above correctly.