hap.py happy treats variants that aren't fully within the confident regions as confident

In our deep dive into happy with DeepVariant we've noticed that hap.py includes in its confident calls for comparison any multi-base event (e.g., a deletion) that merely overlaps with the confident regions, rather than requiring that event to be full contained within the confident regions. Internally we require the event to be fully contained within the confident regions to be considered confident. We wanted to double check here that (a) hap.py intended this behavior and (b) check this is the recommended approach (overlaps, not contains) according to the GA4GH benchmarking recommendations?

Jan 17 '18 19:01 depristo

I think we discussed this in the GA4GH benchmarking group and the conclusion was that this is more of a truthset rather than a comparison issue. Hap.py does stratify these out using the TS_contained (which requires the events to be fully contained) and TS_boundary tags in the VCF (and provides the corresponding precision/recall metrics in the .extended.csv file).

The reason we include both these event types in the total by default is the following:

truth events should be fully contained in the confident regions when creating the truthset
the truth set confident regions should cover every place that is actually considered hom-ref / called, so even partial overlaps with query calls that aren't matched with truth variants can be considered as FP (unless there are issues with the truth set -- we actually had to do some refinement of the Platinum Genomes confident regions in the latest release to deal with issues like this).

Jan 17 '18 19:01 pkrusche

Here's a concrete example:

candidate: no call is present
truth:
20      16755600        rs369843912     AAGAAAAAAG      A       50      PASS    .   GT:DP:ADALL:AD:GQ:IGT:IPS:PS 0/1:228:130,71:145,77:366:0/1:.:.

happy:
20      16755600        .       AAGAAAAAAG      A       50      .       BS=16755600;Regions=CONF,TS_contained   GT:BD:BK:BI:BVT:BLT:QQ

BED confident region:
20      16755608        16790996

Happy is tagging this variant as TS_contained but that's clearly not the case here.

Here's our exact command line:

sudo docker run -it \
-v "${ROOT}:${ROOT}" \
-v "${OUTPUT_DIR}:${OUTPUT_DIR}" \
pkrusche/hap.py /opt/hap.py/bin/hap.py \
  "${TRUTH_VCF}" \
  "${LABELED_VCF}" \
  --preprocess-truth \
  -f "${TRUTH_BED}" \
  -r "${REF%.gz}" \
  -o "${OUTPUT_DIR}/happy.output" \
  -l ${REGION}

Picture of the region (confident region BED is on the bottom): screen shot 2018-01-11 at 4 01 59 pm

This variant is from HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_triophased.vcf.gz and it's associated BED file.

Jan 17 '18 19:01 depristo

This is probably the result of hap.py 'fixing' the confident regions after preprocessing the truth VCF (the corresponding command line options are --adjust-conf-regions and --no-adjust-conf-regions, the default is on to work around issues with older Platinum Genomes releases) to make sure that any truth variant that just overlaps the truth confident regions will be fully contained.

In the GA4GH benchmarking setting we'd typically not preprocess the truth calls (i.e. the truth calls are treated as the 'gold standard' which we shouldn't really change), but we also use this type of preprocessing for internal assessment / methods improvement -- unfortunately, it adds some corner cases just like the one above.

I think running with --no-adjust-conf-regions should at least label the call above correctly.

Jan 17 '18 19:01 pkrusche

hap.py hap.py copied to clipboard

happy treats variants that aren't fully within the confident regions as confident

hap.py
hap.py copied to clipboard