cyvcf2
cyvcf2 copied to clipboard
Issue querying db's made with latest cyvcf2
Hi @brentp,
I've come across a weird issue. I've updated one of our bcbio installations, resulting in an environment with the latest vcf2db + cyvcf2=0.30.14. When I try to query one of the db's generated with this setup using the gemini python API, I get a funky result when fetching genotypes.
when I try to print the gts field from a table row (gemini.GeminiQuery.GeminiRow) I get the following numpy array
["T" "" "" "T" "" "" "C" "" "" "/" "" "" "T" "" "" "T" "" "" "C" "" "" "T" "" "" "/" "" "" "T" "" "" "T" "" "" "C" "" "" "" "" "" "" "" "" "" "" "T" "" "" "/" "" "" "T" "" "" "T" "" "" "C" "" "" "" "" "" "" "" "" "" "" ""] which should show
["TTC/TTC","T/TTC","T/TTC"] e.g. the genotypes for three individuals. This is the case for our older installs running cyvcf2=0.20.9.
I suppose this error may have something to do with https://github.com/brentp/cyvcf2/issues/227. When downgrading cyvcf2 to the older version and regenerating the db, everything seems to work again.
Any thoughts? M
PS, it seems others have also run into similar issues: https://github.com/chapmanb/cloudbiolinux/blob/master/contrib/flavor/ngs_pipeline_minimal/packages-conda.yaml#L354
xpost from https://github.com/quinlan-lab/vcf2db/issues/69
Hi Matthias, can you share a VCF with these 2 rows?
with this script:
import cyvcf2
print(f"version: {cyvcf2.__version__}")
for v in cyvcf2.VCF("tmp.vcf.gz"):
print(v.gt_bases)
run on that file, I see:
version: 0.30.15
['AC|AC' '*|*' 'AC|.']
['AC|AC' './.' 'AC|A']
['./.' 'A/C' './.']
['./.' 'T/G' 'T/T']
['G/G' 'G/A' 'G/A']
['T/T' './.' 'T/A']
['C/T' './.' 'C/T']
['T/G' 'T/G' 'T/G']
['C/T' 'C/T' 'C/T']
['A/G' './.' 'A/A']
which seems to match what I expect. Can you verify that you see the same? If so, then it must be something elsewhere in vcf2db, or in how vcf2db is interacting with cyvcf2.
Perhaps it's since we updated how cyvcf2 is built?
Hi Brent
I'm getting the following results.
py2.7
cyvcf2 version: 0.30.15
numpy version: 1.16.5
[u'AC|AC' u'*|*' u'AC|.']
[u'AC|AC' u'./.' u'AC|A']
[u'./.' u'A/C' u'./.']
[u'./.' u'T/G' u'T/T']
[u'G/G' u'G/A' u'G/A']
[u'T/T' u'./.' u'T/A']
[u'C/T' u'./.' u'C/T']
[u'T/G' u'T/G' u'T/G']
[u'C/T' u'C/T' u'C/T']
[u'A/G' u'./.' u'A/A']
and
cyvcf2 version: 0.20.9
numpy version: 1.16.5
['AC|AC' '*|*' 'AC|.']
['AC|AC' './.' 'AC|A']
['./.' 'A/C' './.']
['./.' 'T/G' 'T/T']
['G/G' 'G/A' 'G/A']
['T/T' './.' 'T/A']
['C/T' './.' 'C/T']
['T/G' 'T/G' 'T/G']
['C/T' 'C/T' 'C/T']
['A/G' './.' 'A/A']
py3
cyvcf2 version: 0.30.15
numpy version: 1.22.3
['AC|AC' '*|*' 'AC|.']
['AC|AC' './.' 'AC|A']
['./.' 'A/C' './.']
['./.' 'T/G' 'T/T']
['G/G' 'G/A' 'G/A']
['T/T' './.' 'T/A']
['C/T' './.' 'C/T']
['T/G' 'T/G' 'T/G']
['C/T' 'C/T' 'C/T']
['A/G' './.' 'A/A']
It seems to me the latest version encodes the strings in unicode on python2, whereas the older version did not. I can imagine thats the cause of the issues downstream.
I know it's kind of stupid to still be using python2, but that's the way it currently configured in the bcbio environment.
Ping @naumenko-sa. Sergey do you remember the reason as to why cyvcf2/vcf2db hadn't been moved to the main environment? I know we talked about it somewhere, but I don't seem to find it right now.
Matthias
I suppose it's a numpy issue https://github.com/brentp/cyvcf2/blob/5a3d046cc9e47ea5b6ae3dd980a8d9ad63a77623/cyvcf2/cyvcf2.pyx#L1183 vs https://github.com/brentp/cyvcf2/blob/4bafcf9ff3bfa2305672100d90e72df8d9388cf7/cyvcf2/cyvcf2.pyx#L1177
xref: https://github.com/brentp/cyvcf2/issues/191