HQ is underdefined
HQ (Integer): Haplotype qualities, two comma separated phred qualities
The field, although being part of the example at the start of the document is underdefined.
- What is the 'haplotype' this is referring to?
- Do variants within the same PS phase block share the same HQ?
- Is HQ impacted by GQ?
- i.e. is this field a measure of the phasing quality, both the local allele call quality and phasing quality, or something else?
- Can we generalise a definition that works for both PS and PSL phasing?
-
Number=2presumes diploid. This field should be redefined as Number=P - How do we handle partial phasing and non-diploid?
- e.g. how does
HQinteract withGT=|0/1/1or'/0|0|1|1?
- e.g. how does
I had a hunt through the commit history and it was there on day 1 (well day 1 of commit history) in 2013 for VCF 4.1. I was optimistically hoping there'd be a smoking gun recording which tool generates such data. Does anyone have access to the earlier versions of the spec and commit history there of? I think Heng used Mac Pages before tex.
I've hunted through my various VCF files and can't find it ever being used.
There are Pages word processing versions of the SAM spec from before it was converted to LaTeX: I have two very similar copies dating from July 2009 that I think can be found in the samtools-devel archives. They were never in publicly-accessible source control.
The VCF specification originated as a 1000 Genomes wiki page before it was converted to LaTeX. Last I looked the wiki was long gone, but I have copies of the VCFv4.1 and VCFv4.2 wiki pages from October 2013. (Probably I snagged them then, or possibly later via the Wayback Machine.)
The wiki pages just contain the same description of HQ as the current spec does. Possibly there are some clues in the vcftools-spec archives.