hap.py
hap.py copied to clipboard
VCF Comparison with same file does not yield perfect result
I am currently doing benchmarking of the VCF files with the gold standard file HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf. While I am giving both the query as well as the truth as the GIAB VCF file themselves, I am getting a perfect score of precision, recall, and f1. However, for any other random VCF files which are compared with themselves, I am not getting a perfect score rather am getting around 90%.
Is this normal behaviour or a there a different reason? Is the gold standard file hardcoded in the hap.py?
Same here
That usually means the vcf in question is not self-consistent, e.g. not all of the asserted variants can be simultaneously possible in a diploid genome. In a few cases it can be complex situations that vcfeval might resolve that xcmp misses. Are you using xcmp or the vcfeval engine?
You can look at the specific FP variants and they will be nearby other variants that occupy/overlap the same reference coordinates.
Thank you for the answer. I am not using any other tool for benchmarking but only hap.py. One quick question: Do we need to preprocess the VCF files by running any python code before passing to hap.py ? I am just using hap.py truth.vcf query.vcf -f confident.bed -o output_prefix -r reference.fa command for comparison
Adding the option --engine vcfeval
usually improves detection of trickier variants. If there are any partially overlapping calls, this might be able to detect them using a combination of this and --ref-overlap
.
I can reproduce this error, but it looks like it affects a somewhat small fraction of variants.
For example, I have attached the summary of analyzing the Genome In A Bottle (GIAB) v4.2.1 benchmark with itself (HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz, from https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/latest/GRCh38/):
CALLABLE-GAIB_v4.2.1-vs-GAIB_v4.2.1.summary.csv
I generated that result with the following command:
singularity exec --bind $MOUNTIN:$MOUNTOUT hap.py_latest.sif /opt/hap.py/bin/hap.py $REF $TEST -f $BED -o $OUT -r $FA
This uses HG002_GRCh38_1_22_v4.2.1_callablemultinter_gt0.bed.gz to define the regions to test (even if I think HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed.gz is often used).
Hi, How do you use --ref-overlap ? It's not an option of hap.py. Is it in a config file ?