htslib icon indicating copy to clipboard operation
htslib copied to clipboard

Phasing of first genotype

Open lczech opened this issue 3 years ago • 5 comments

Hi there,

according to the VCF Spec 4.2,

For one individual, each integer in the vector is organized as (allele + 1) << 1 | phased [...]

(emphasis mine)

However, looking at the implementation as well as some quick tests, it seems that the first genotype (e.g., the 0 in 0|1) never has the phased bit set, so that bcf_gt_is_phased() always returns false. It seems to me that at least for diploids, that does not make all too much sense, as knowing that the second genotype is phased also determines this for the first one, doesn't it?!

The VCF Spec however also has the example

0/1|2 is tetraploid with a single phased allele

In that case, it is not clear to me how to interpret phasing. Only the 2 is phased here, while the order of the 0 and 1 is not determined?

So, my questions here are:

  • Is it an oversight in the VCF standard that the first genotype cannot be marked as phased, or is that an oversight/bug in the implementation?
  • How would the poly-ploid case from above have to be handled/interpreted correctly?
  • Is there any "best practice" to test for phasing? Should I iterate all values for a given sample and output "phased" if all or if any of the bcf_gt_is_phased() calls return true?

Thanks and all the best Lucas

lczech avatar Jul 28 '20 03:07 lczech

Thank you for raising this issue. I am not a VCF maintainer and admit that I found this confusing too when I looked into it. However I see this very topic is in discussion already, with a proposal to fix it (possibly wrong? or exposing a difference in VCF and BCF?). I'm out of my depth really, but I'd recommend adding a users perspective to the https://github.com/samtools/hts-specs/pull/421 PR.

jkbonfield avatar Jul 28 '20 15:07 jkbonfield

Also, I note that "0 / 1 | 2" comment was there in the initial checkin of the initial VCF spec, so we have no more history to know where it came from and who authored it and whether this is what was intended. It was committed by @amarcket but maybe @pd3 or @lh3 know the origin of this too. Can the above named people cast their eyes over the PR mentioned above too?

jkbonfield avatar Jul 28 '20 16:07 jkbonfield

This was never explicitly defined. I am open to modifying the existing specification and htslib as mixed ploidy is not a widely used feature.

pd3 avatar Jul 29 '20 05:07 pd3

It does appear to be implicitly defined in BCF though as the spec indicates the | vs / affects the next value, not the previous one.

jkbonfield avatar Jul 29 '20 08:07 jkbonfield

Yes, and the examples 0|1 and 0/1|2 given in the specification are the only mention of this. The "phase follows the allele" notation seems more intuitive and VCF is the major format, an argument in favor of changing the specification. I am happy either way though, as long as the ambiguity is resolved.

pd3 avatar Jul 29 '20 08:07 pd3