hap.py icon indicating copy to clipboard operation
hap.py copied to clipboard

Target region specification: weird observation

Open ghost opened this issue 6 years ago • 0 comments

Hi Peter,

I'm running hap.py to compare the F1-score of different variant callers for exome data using the GIAB truth set as reference.

In order to check how strong the adjacent parts of the annotated exonic regions influence the F1-score of the respective run, the following approach was used:

  1. The given variant caller was run with the BED-target file containing information about the exonic regions and an extensions of plus/minus 10 bases around these regions, resulting in a vcf file, say "ext10.vcf".

  2. The output VCF of 1) was used to run hap.py with different BED-files (10 bases extension, 8 bases extension, ..., 0 bases extension) with the -f parameter in order to restrict the analysis to lower extensions gradually and then check the respective F1-score: run 1: ext10.vcf + ext10.bed run 2: ext10.vcf + ext8.bed ... run 5: ext10.vcf + ext0.bed

For the sake of completeness, the basic variant calling of 1) was repeated with the different BED files used in 2) and then the same BED files were also used for the following hap.py runs to compare the F1-scores: run 1: ext10.vcf + ext10.bed run 2: ext8.vcf + ext8.bed ... run 5: ext0.vcf + ext0.bed

Surprisingly, the F1-score dropped around 0.3-0.5% in the second approach for both, SNPs and Indels, due to slightly different numbers in FP or FN.

Now I'm wondering whether my understanding might be wrong but I would claim that it should not make a difference whether one restricts the analysis in the actual hap.py run by a given BED file or already using the BED file in the variant calling process. I also don't know whether it might be due to the respective variant caller or to hap.py.

It would be nice if you could clarify my misunderstanding. Thanks in advance and best regards!

ghost avatar Aug 29 '19 13:08 ghost