myvariant.info
myvariant.info copied to clipboard
data inconsistency in dbsnp collection
Symptom
E.g. http://myvariant.info/v1/variant/chrY:g.100005T%3EG?fields=dbsnp returns
{
"_id": "chrY:g.100005T>G",
"_version": 1,
"dbsnp": {
...
"chrom": "X",
...
}
}
where the chrom
field is inconsistent with the chromosome in the _id
.
Analysis
In fact, the whole dbsnp
field above is incorrect and it's caused by the bug in the parse_one_rec function. Its for-loop was designed to yield multiple doc
objects but essentially there was ONLY ONE doc
created. Previously yielded doc
's attributes will be modified by the next execution of the for-loop body. In the end, a bunch of the shallow copies of the solo doc
object, instead of a bunch of doc
objects, were written to mongoDB and thus ES.
Reproduction of the Bug
E.g. The record of {refsnp_id': '1173046527, ...}
from source file refsnp-chrY.json.bz2
is supposed to yield two documents
[
{
'_id': 'chrY:g.100005T>G',
'dbsnp': {
'alleles': [
{'freq': {'gnomad': 1.0, 'sgdp_prj': 0.0, 'dbgap_popfreq': 1.0}, 'allele': 'T'},
{'freq': {'gnomad': 0.0, 'sgdp_prj': 1.0, 'dbgap_popfreq': 0.0}, 'allele': 'G'}
],
'hg19': {'start': 100005, 'end': 100005},
'vartype': 'snv',
'rsid': 'rs1173046527',
'dbsnp_build': 155,
'chrom': 'Y',
'ref': 'T',
'alt': 'G'
}
},
{
'_id': 'chrX:g.150005T>G',
'dbsnp': {
'alleles': [
{'freq': {'gnomad': 1.0, 'sgdp_prj': 0.0, 'dbgap_popfreq': 1.0}, 'allele': 'T'},
{'freq': {'gnomad': 0.0, 'sgdp_prj': 1.0, 'dbgap_popfreq': 0.0}, 'allele': 'G'}
],
'hg19': {'start': 150005, 'end': 150005},
'vartype': 'snv',
'rsid': 'rs1173046527',
'dbsnp_build': 155,
'chrom': 'X',
'ref': 'T',
'alt': 'G'
}
}
]
However in fact the dbsnp
object in the latter document will "overwrite" in the former document.
Fix
Create a deep copy of doc
object to every time it's going to be yielded in the parse_one_rec function.