cyvcf2 icon indicating copy to clipboard operation
cyvcf2 copied to clipboard

Issue querying db's made with latest cyvcf2

Open matthdsm opened this issue 2 years ago • 5 comments

Hi @brentp,

I've come across a weird issue. I've updated one of our bcbio installations, resulting in an environment with the latest vcf2db + cyvcf2=0.30.14. When I try to query one of the db's generated with this setup using the gemini python API, I get a funky result when fetching genotypes.

when I try to print the gts field from a table row (gemini.GeminiQuery.GeminiRow) I get the following numpy array

["T" "" "" "T" "" "" "C" "" "" "/" "" "" "T" "" "" "T" "" "" "C" "" "" "T" "" "" "/" "" "" "T" "" "" "T" "" "" "C" "" "" "" "" "" "" "" "" "" "" "T" "" "" "/" "" "" "T" "" "" "T" "" "" "C" "" "" "" "" "" "" "" "" "" "" ""] which should show

["TTC/TTC","T/TTC","T/TTC"] e.g. the genotypes for three individuals. This is the case for our older installs running cyvcf2=0.20.9.

I suppose this error may have something to do with https://github.com/brentp/cyvcf2/issues/227. When downgrading cyvcf2 to the older version and regenerating the db, everything seems to work again.

Any thoughts? M

PS, it seems others have also run into similar issues: https://github.com/chapmanb/cloudbiolinux/blob/master/contrib/flavor/ngs_pipeline_minimal/packages-conda.yaml#L354

xpost from https://github.com/quinlan-lab/vcf2db/issues/69

matthdsm avatar Mar 29 '22 10:03 matthdsm

Hi Matthias, can you share a VCF with these 2 rows?

brentp avatar Mar 29 '22 10:03 brentp

Sure,

here's an excerpt with the 10 first variants from that file. tmp.vcf.txt

matthdsm avatar Mar 29 '22 10:03 matthdsm

with this script:

import cyvcf2

print(f"version: {cyvcf2.__version__}")
for v in cyvcf2.VCF("tmp.vcf.gz"):
    print(v.gt_bases)

run on that file, I see:

version: 0.30.15
['AC|AC' '*|*' 'AC|.']
['AC|AC' './.' 'AC|A']
['./.' 'A/C' './.']
['./.' 'T/G' 'T/T']
['G/G' 'G/A' 'G/A']
['T/T' './.' 'T/A']
['C/T' './.' 'C/T']
['T/G' 'T/G' 'T/G']
['C/T' 'C/T' 'C/T']
['A/G' './.' 'A/A']

which seems to match what I expect. Can you verify that you see the same? If so, then it must be something elsewhere in vcf2db, or in how vcf2db is interacting with cyvcf2.

Perhaps it's since we updated how cyvcf2 is built?

brentp avatar Mar 29 '22 10:03 brentp

Hi Brent

I'm getting the following results.

py2.7

cyvcf2 version: 0.30.15
numpy version: 1.16.5
[u'AC|AC' u'*|*' u'AC|.']
[u'AC|AC' u'./.' u'AC|A']
[u'./.' u'A/C' u'./.']
[u'./.' u'T/G' u'T/T']
[u'G/G' u'G/A' u'G/A']
[u'T/T' u'./.' u'T/A']
[u'C/T' u'./.' u'C/T']
[u'T/G' u'T/G' u'T/G']
[u'C/T' u'C/T' u'C/T']
[u'A/G' u'./.' u'A/A']

and

cyvcf2 version: 0.20.9
numpy version: 1.16.5
['AC|AC' '*|*' 'AC|.']
['AC|AC' './.' 'AC|A']
['./.' 'A/C' './.']
['./.' 'T/G' 'T/T']
['G/G' 'G/A' 'G/A']
['T/T' './.' 'T/A']
['C/T' './.' 'C/T']
['T/G' 'T/G' 'T/G']
['C/T' 'C/T' 'C/T']
['A/G' './.' 'A/A']

py3

cyvcf2 version: 0.30.15
numpy version: 1.22.3
['AC|AC' '*|*' 'AC|.']
['AC|AC' './.' 'AC|A']
['./.' 'A/C' './.']
['./.' 'T/G' 'T/T']
['G/G' 'G/A' 'G/A']
['T/T' './.' 'T/A']
['C/T' './.' 'C/T']
['T/G' 'T/G' 'T/G']
['C/T' 'C/T' 'C/T']
['A/G' './.' 'A/A']

It seems to me the latest version encodes the strings in unicode on python2, whereas the older version did not. I can imagine thats the cause of the issues downstream.

I know it's kind of stupid to still be using python2, but that's the way it currently configured in the bcbio environment.

Ping @naumenko-sa. Sergey do you remember the reason as to why cyvcf2/vcf2db hadn't been moved to the main environment? I know we talked about it somewhere, but I don't seem to find it right now.

Matthias

matthdsm avatar Mar 29 '22 11:03 matthdsm

I suppose it's a numpy issue https://github.com/brentp/cyvcf2/blob/5a3d046cc9e47ea5b6ae3dd980a8d9ad63a77623/cyvcf2/cyvcf2.pyx#L1183 vs https://github.com/brentp/cyvcf2/blob/4bafcf9ff3bfa2305672100d90e72df8d9388cf7/cyvcf2/cyvcf2.pyx#L1177

xref: https://github.com/brentp/cyvcf2/issues/191

matthdsm avatar Mar 29 '22 12:03 matthdsm