hap.py icon indicating copy to clipboard operation
hap.py copied to clipboard

VCF format issues with --write-vcf, - FORMAT field inconsistencies

Open insilicool opened this issue 1 year ago • 1 comments

Hi,

I've been attempting to parse the vcfs generated by the --write-vcf option into R and encountering issues. I believe this has to do with the inconsistency in the number of fields in the FORMAT section.

e.g. with the CMRG truth set

module load mugqic_dev/python/2.7.13 &&
0.3.15/hap.py
--threads 3 --write-vcf --decompose --preprocess-truth
--engine vcfeval
HG002_GRCh37_CMRG_smallvar_v1.00.vcf.gz
HG002_GRCh37_CMRG_smallvar_v1.00.vcf.gz
-f HG002_GRCh37_CMRG_smallvar_v1.00.bed
-r Homo_sapiens.hs37d5.fa
--engine-vcfeval-template Homo_sapiens.hs37d5.SDF
-o comparisons_0.3.15/HG002-cmrg-happy_0.3.15

This results in variants looking like this: bcftools query -f '%CHROM\t%POS\t%FORMAT\n' HG002-cmrg-happy_0.3.15.vcf.gz

Problematic calls: 21 47407748 GT:BD:BI:BVT:BLT:QQ 0/1:N:ti:SNP:het:. 0/1:N:ti:SNP:het:0 21 47407754 GT:BD:BI:BVT:BLT:QQ 0/1:N:i1_5:INDEL:het:. 0/1:N:i1_5:INDEL:het:0

Correctly formatted: 21 47408099 GT:BD:BK:QQ:BI:BVT:BLT 1/1:TP:gm:30:tv:SNP:homalt 1/1:TP:gm:30:tv:SNP:homalt 21 47408101 GT:BD:BK:QQ:BI:BVT:BLT 0/1:TP:gm:30:i16_plus:INDEL:het 0/1:TP:gm:30:i16_plus:INDEL:het

As you can see BK field is missing and QQ field is in a different location. Additionally, I don't see documentation explaining what a BD ="N" represents.

I first observed this in 0.3.12 but it persists in 0.3.15 as well.

Thanks for any help and insight you can provide.

  • Rob Eveleigh

insilicool avatar Apr 18 '23 16:04 insilicool