myvariant.info icon indicating copy to clipboard operation
myvariant.info copied to clipboard

data inconsistency in dbsnp collection

Open erikyao opened this issue 2 years ago • 0 comments

Symptom

E.g. http://myvariant.info/v1/variant/chrY:g.100005T%3EG?fields=dbsnp returns

{
  "_id": "chrY:g.100005T>G",
  "_version": 1,
  "dbsnp": {
    ...
    "chrom": "X",
    ...
  }
}

where the chrom field is inconsistent with the chromosome in the _id.

Analysis

In fact, the whole dbsnp field above is incorrect and it's caused by the bug in the parse_one_rec function. Its for-loop was designed to yield multiple doc objects but essentially there was ONLY ONE doc created. Previously yielded doc's attributes will be modified by the next execution of the for-loop body. In the end, a bunch of the shallow copies of the solo doc object, instead of a bunch of doc objects, were written to mongoDB and thus ES.

Reproduction of the Bug

E.g. The record of {refsnp_id': '1173046527, ...} from source file refsnp-chrY.json.bz2 is supposed to yield two documents

[
  {
      '_id': 'chrY:g.100005T>G',
      'dbsnp': {
          'alleles': [
              {'freq': {'gnomad': 1.0, 'sgdp_prj': 0.0, 'dbgap_popfreq': 1.0}, 'allele': 'T'},
              {'freq': {'gnomad': 0.0, 'sgdp_prj': 1.0, 'dbgap_popfreq': 0.0}, 'allele': 'G'}
          ],
          'hg19': {'start': 100005, 'end': 100005},
          'vartype': 'snv',
          'rsid': 'rs1173046527',
          'dbsnp_build': 155,
          'chrom': 'Y',
          'ref': 'T',
          'alt': 'G'
      }
  },
  {
      '_id': 'chrX:g.150005T>G',
      'dbsnp': {
          'alleles': [
              {'freq': {'gnomad': 1.0, 'sgdp_prj': 0.0, 'dbgap_popfreq': 1.0}, 'allele': 'T'},
              {'freq': {'gnomad': 0.0, 'sgdp_prj': 1.0, 'dbgap_popfreq': 0.0}, 'allele': 'G'}
          ],
          'hg19': {'start': 150005, 'end': 150005},
          'vartype': 'snv',
          'rsid': 'rs1173046527',
          'dbsnp_build': 155,
          'chrom': 'X',
          'ref': 'T',
          'alt': 'G'
      }
  }
]

However in fact the dbsnp object in the latter document will "overwrite" in the former document.

Fix

Create a deep copy of doc object to every time it's going to be yielded in the parse_one_rec function.

erikyao avatar Nov 30 '21 03:11 erikyao