cyvcf2
cyvcf2 copied to clipboard
Accessing GT field via format method
Hello,
Thanks for your effort on cyvcf2, this is a great project.
I have a question regarding accessing GT field via format
function. I can access genotypes using the genotypes
attribute:
In [5]: variant.genotypes
Out[5]: [[0, 1, False], [0, 1, False]]
However, when I try to access it via 'format' method, here's what I get:
In [6]: variant.format('GT')
Out[6]:
array(['\x02\x04', '\x02\x04'],
dtype='|S2')
How do I interpret this output? I can access other fields via format
method just fine (e.g. I get a human-friendly output). I wonder if there's something obvious that I'm missing.
Any help is much appreciated. Thanks in advance!
(as you saw) The GT
field is encoded. So accessing it directly is not currently useful. I suppose I could special-case it.
variant.format('GT')
could return e..g ['0/1', '0/1'], but that's almost never useful.
I think the most pythonic thing would be to return an object or array that allowed intuitive access to genotype, phase, ploidy, but I never settled on a good representation of that.
I'm open to suggestions. What do you need out of that field and can you get it via variant.genotypes
?
Hi @brentp, thanks for your answer. I'm writing an app for querying VCF files and calculating statistics on variants selected by queries. I'd prefer to have a uniform access to all fields, just because every special case makes the code more complex (i.e. I have to write code like "access a genotype field using format
function, unless it's a GT
field, in which case use genotypes
attribute).
A namedtuple with genotype
, phase
and ploidy
would be the most convenient, but I can just take the data from variant.genotypes
, that's not a big deal.
I am not a bioinformatician, so I was wondering whether my example VCF files are malformed or something else is wrong on my side so that the GT
field is not decoded.
Has there been any update to accessing the GT
values? I am still getting:
print(variant.format('GT'))
['' '' '' '\x04\x04' '\x02\x02' '\x02\x02' '\x02\x02']
# this one is just too difficult to read
print(variant.genotypes)
[[-1, -1, False], [-1, -1, False], [-1, -1, False], [1, 1, False], [0, 0, False], [0, 0, False], [0, 0, False]]
So variant.format('GT')
returns a string because that's how these are encoded.
You can get a more ergonomic view using:
gts = variant.genotypes
class Genotype(object):
__slots__ = ('alleles', 'phased')
def __init__(self, li):
self.alleles = li[:-1]
self.phased = li[-1]
def __str__(self):
sep = "/|"[int(self.phased)]
return sep.join("0123."[a] for a in self.alleles)
__repr__ = __str__
genotypes = [Genotype(li) for li in gts]
print genotypes
which shows:
[./., ./., ./., 1/1, 0/0, 0/0, 0/0]
This is the advantage of the variant.gentoypes
representation. It allows you to write your own methods around it. The Genotype
class above is 8 or 9 lines of code.
Eventually, I may put a nicer representation into cyvcf2, but there I have to balance speed and design. The variant.genotypes
is pretty fast...
That class
really is a nice representation. I think the method to mine the genotype_bases
can also be added to it.
I am writing a python tool to simplify the vcf file. For now I am just going to add the class to it. But, will have to remove it after it builds into cyvcf2
.