cyvcf2 Accessing GT field via format method

Hello,

Thanks for your effort on cyvcf2, this is a great project. I have a question regarding accessing GT field via format function. I can access genotypes using the genotypes attribute:

In [5]: variant.genotypes
Out[5]: [[0, 1, False], [0, 1, False]]

However, when I try to access it via 'format' method, here's what I get:

In [6]: variant.format('GT')
Out[6]: 
array(['\x02\x04', '\x02\x04'],
      dtype='|S2')

How do I interpret this output? I can access other fields via format method just fine (e.g. I get a human-friendly output). I wonder if there's something obvious that I'm missing.

Any help is much appreciated. Thanks in advance!

Jul 31 '17 11:07 tetianakh

(as you saw) The GT field is encoded. So accessing it directly is not currently useful. I suppose I could special-case it.

variant.format('GT') could return e..g ['0/1', '0/1'], but that's almost never useful.

I think the most pythonic thing would be to return an object or array that allowed intuitive access to genotype, phase, ploidy, but I never settled on a good representation of that.

I'm open to suggestions. What do you need out of that field and can you get it via variant.genotypes ?

Jul 31 '17 13:07 brentp

Hi @brentp, thanks for your answer. I'm writing an app for querying VCF files and calculating statistics on variants selected by queries. I'd prefer to have a uniform access to all fields, just because every special case makes the code more complex (i.e. I have to write code like "access a genotype field using format function, unless it's a GT field, in which case use genotypes attribute).

A namedtuple with genotype, phase and ploidy would be the most convenient, but I can just take the data from variant.genotypes, that's not a big deal.

I am not a bioinformatician, so I was wondering whether my example VCF files are malformed or something else is wrong on my side so that the GT field is not decoded.

Aug 07 '17 08:08 tetianakh

Has there been any update to accessing the GT values? I am still getting:

print(variant.format('GT'))
['' '' '' '\x04\x04' '\x02\x02' '\x02\x02' '\x02\x02']

# this one is just too difficult to read
print(variant.genotypes)
[[-1, -1, False], [-1, -1, False], [-1, -1, False], [1, 1, False], [0, 0, False], [0, 0, False], [0, 0, False]]

Feb 20 '18 02:02 everestial

So variant.format('GT') returns a string because that's how these are encoded.

You can get a more ergonomic view using:

gts = variant.genotypes

class Genotype(object):
    __slots__ = ('alleles', 'phased')

    def __init__(self, li):
        self.alleles = li[:-1]
        self.phased = li[-1]

    def __str__(self):
        sep = "/|"[int(self.phased)]
        return sep.join("0123."[a] for a in self.alleles)
    __repr__ = __str__

genotypes = [Genotype(li) for li in gts]
print genotypes

which shows:

[./., ./., ./., 1/1, 0/0, 0/0, 0/0]

This is the advantage of the variant.gentoypes representation. It allows you to write your own methods around it. The Genotype class above is 8 or 9 lines of code.

Eventually, I may put a nicer representation into cyvcf2, but there I have to balance speed and design. The variant.genotypes is pretty fast...

Feb 20 '18 16:02 brentp

That class really is a nice representation. I think the method to mine the genotype_bases can also be added to it. I am writing a python tool to simplify the vcf file. For now I am just going to add the class to it. But, will have to remove it after it builds into cyvcf2.

Feb 25 '18 15:02 everestial

cyvcf2 cyvcf2 copied to clipboard

Accessing GT field via format method

cyvcf2
cyvcf2 copied to clipboard