PyVCF
PyVCF copied to clipboard
0/. has numeric representation (gt_type) 1
Hello James, First of all, thank you for creating PyVCF. Secondly, I'm working on a manually filtered vcf file, so it could be that the error is due to an inconsistency in the used vcf file in relation to the vcf standard... That said, I don't think it is correct to numerically represent variants for which one of the alleles has the reference and the other allele is unknown, so 0/., as 1. I would represent them as either 'None' or 0. Where 'None' would have my preference.
I encountered this when using PyVCF version 0.6.0.
A response with your views on this would be greatly appreciated!
Thanks, Jasper
Yes, I think you are right.
Do you have a patch?
Slightly off-topic (I agree and vote for None), but does anyone know which variant callers actually generate GT values with the correct use of . (no call) as opposed to just using 0 (reference) for everything? Not just for heterozygous variants, but also for what should be ./. (actually, I'm not even sure 0/. is allowed at all).
No, I don't have a patch. I now have a workaround in which i do check for the occurrence of a '.' in the data.GT. In case, I interpret it as a missing call. I would say that something similar can be implemented in the gt_type method...
I think this needs some more thinking. If we make a choice for 0/., what about 1/.? What does it even mean?
Reading the spec, I think . can only be used if no call could be made for the sample, so for diploid this would only allow ./..
So I'm inclined to set _Call.called to False for all of these cases (0/., 1/., ./.). This would directly fix Jasper's issue and some other properties like is_variant and is_het.