datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Include more metadata from genbank files in virus reports, e.g. `/note` and `/strain`

Open corneliusroemer opened this issue 3 months ago • 1 comments

Quite frequently, valuable metadata is contained in the genbank file field '/note`.

Unfortunately, this field seems to get lost on the way to 'datasets download virus genome'

Consider the metadata available for the genbank file under SOURCE:

FEATURES             Location/Qualifiers
     source          1..2408
                     /organism="Zaire ebolavirus"
                     /mol_type="genomic RNA"
                     /strain="Mayinga 1976"
                     /db_xref="taxon:[186538](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=186538)"
                     /note="subtype: Zaire"

with what ends up in datasets download virus genome taxon:

{
  "accession": "U28077.1",
  "completeness": "PARTIAL",
  "isAnnotated": true,
  "length": 2408,
  "nucleotide": {
    "sequenceHash": "62B2211F"
  },
  "proteinCount": 2,
  "releaseDate": "1995-10-26T00:00:00Z",
  "sourceDatabase": "GenBank",
  "submitter": {
    "affiliation": "Anthony Sanchez, Special Pathogens Branch, Division of Viral and Rickettsial Diseases, Centers for Disease Control and Prevention, 1600 Clifton Road, Blgd. 15, Room SB611, Atlanta, GA 30333",
    "country": "USA",
    "names": [
      "Sanchez,A.",
      "Trappier,S.G.",
      "Mahy,B.W.",
      "Peters,C.J.",
      "Nichol,S.T."
    ]
  },
  "updateDate": "2002-08-28T00:00:00Z",
  "virus": {
    "organismName": "Zaire ebolavirus",
    "taxId": 186538
  }

Valuable information is lost:

  • /note="subtype: Zaire"
  • /strain="Mayinga 1976"
  • /mol_type="genomic RNA"

This is probably not even such a good example, I can think of more important notes but couldn't find an example just now.

It would be nice, if all this metadata was passed through.

In fact, it might be a bug that molType is missing, as that is a field that should already be output per the schema here: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/data-reports/virus/

image

corneliusroemer avatar Mar 21 '24 19:03 corneliusroemer