hap.py icon indicating copy to clipboard operation
hap.py copied to clipboard

VCF Comparison with same file does not yield perfect result

Open itsroops opened this issue 3 years ago • 6 comments

I am currently doing benchmarking of the VCF files with the gold standard file HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf. While I am giving both the query as well as the truth as the GIAB VCF file themselves, I am getting a perfect score of precision, recall, and f1. However, for any other random VCF files which are compared with themselves, I am not getting a perfect score rather am getting around 90%.

Is this normal behaviour or a there a different reason? Is the gold standard file hardcoded in the hap.py?

itsroops avatar Jun 03 '21 17:06 itsroops

Same here

serge2016 avatar Jun 04 '21 06:06 serge2016

That usually means the vcf in question is not self-consistent, e.g. not all of the asserted variants can be simultaneously possible in a diploid genome. In a few cases it can be complex situations that vcfeval might resolve that xcmp misses. Are you using xcmp or the vcfeval engine?

You can look at the specific FP variants and they will be nearby other variants that occupy/overlap the same reference coordinates.

Lenbok avatar Jun 04 '21 07:06 Lenbok

Thank you for the answer. I am not using any other tool for benchmarking but only hap.py. One quick question: Do we need to preprocess the VCF files by running any python code before passing to hap.py ? I am just using hap.py truth.vcf query.vcf -f confident.bed -o output_prefix -r reference.fa command for comparison

itsroops avatar Jun 04 '21 07:06 itsroops

Adding the option --engine vcfeval usually improves detection of trickier variants. If there are any partially overlapping calls, this might be able to detect them using a combination of this and --ref-overlap.

TimD1 avatar Jan 23 '22 18:01 TimD1

I can reproduce this error, but it looks like it affects a somewhat small fraction of variants.

For example, I have attached the summary of analyzing the Genome In A Bottle (GIAB) v4.2.1 benchmark with itself (HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz, from https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/latest/GRCh38/):

CALLABLE-GAIB_v4.2.1-vs-GAIB_v4.2.1.summary.csv

I generated that result with the following command:

singularity exec --bind $MOUNTIN:$MOUNTOUT hap.py_latest.sif /opt/hap.py/bin/hap.py $REF $TEST -f $BED -o $OUT -r $FA

This uses HG002_GRCh38_1_22_v4.2.1_callablemultinter_gt0.bed.gz to define the regions to test (even if I think HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed.gz is often used).

cwarden45 avatar Aug 10 '23 23:08 cwarden45

Hi, How do you use --ref-overlap ? It's not an option of hap.py. Is it in a config file ?

wcarre avatar Apr 17 '24 08:04 wcarre