datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Include more metadata from genbank files in virus reports, e.g. `/note` and `/strain`

Open corneliusroemer opened this issue 1 year ago • 1 comments

Quite frequently, valuable metadata is contained in the genbank file field '/note`.

Unfortunately, this field seems to get lost on the way to 'datasets download virus genome'

Consider the metadata available for the genbank file under SOURCE:

FEATURES             Location/Qualifiers
     source          1..2408
                     /organism="Zaire ebolavirus"
                     /mol_type="genomic RNA"
                     /strain="Mayinga 1976"
                     /db_xref="taxon:[186538](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=186538)"
                     /note="subtype: Zaire"

with what ends up in datasets download virus genome taxon:

{
  "accession": "U28077.1",
  "completeness": "PARTIAL",
  "isAnnotated": true,
  "length": 2408,
  "nucleotide": {
    "sequenceHash": "62B2211F"
  },
  "proteinCount": 2,
  "releaseDate": "1995-10-26T00:00:00Z",
  "sourceDatabase": "GenBank",
  "submitter": {
    "affiliation": "Anthony Sanchez, Special Pathogens Branch, Division of Viral and Rickettsial Diseases, Centers for Disease Control and Prevention, 1600 Clifton Road, Blgd. 15, Room SB611, Atlanta, GA 30333",
    "country": "USA",
    "names": [
      "Sanchez,A.",
      "Trappier,S.G.",
      "Mahy,B.W.",
      "Peters,C.J.",
      "Nichol,S.T."
    ]
  },
  "updateDate": "2002-08-28T00:00:00Z",
  "virus": {
    "organismName": "Zaire ebolavirus",
    "taxId": 186538
  }

Valuable information is lost:

  • /note="subtype: Zaire"
  • /strain="Mayinga 1976"
  • /mol_type="genomic RNA"

This is probably not even such a good example, I can think of more important notes but couldn't find an example just now.

It would be nice, if all this metadata was passed through.

In fact, it might be a bug that molType is missing, as that is a field that should already be output per the schema here: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/data-reports/virus/

image

corneliusroemer avatar Mar 21 '24 19:03 corneliusroemer

Hi corneliusroemer,

Thank you for your suggestions. We are currently reviewing your metadata requests in collaboration with the NCBI Virus team. We will resolve any issues on our end. However, some metadata requests might require coordination with the NCBI Virus team. I will update you once we start working on this.

All the best,

Nuala

Nuala A. O'Leary, PhD Product Owner, NCBI Datasets National Center for Biotechnology Information, NLM, NIH, DHHS

olearyna avatar Mar 22 '24 00:03 olearyna

Hi corneliusroemer,

I discussed your request with the NCBI Virus group. There are no current plans to pull data from the /note section of the GenBank record but they will look into it. Any updates they make will be picked up by NCBI Datasets. You can contact the NCBI Virus group through the general NCBI feedback form https://support.nlm.nih.gov/support/create-case/.

Thanks, Nuala

olearyna avatar Apr 08 '24 13:04 olearyna

Any news on the integration of the /mol_type --> "molType" ? Or are there other ways to infer these from taxonomy data? I'd hate to be forced to download Genbank format as well in the future...

dandaman avatar Jun 26 '24 14:06 dandaman

Hi dandaman,

We don't have moltype in the virus report yet but you can get it from the taxonomy data report for any tax id.

Here is the command using dataformat to get the taxid from the virus report

datasets summary virus genome accession U28077.1 --as-json-lines | dataformat tsv virus-genome --fields virus-tax-id --elide-header
186538

Here is the command to get the moltype from the taxonomy report using jq

datasets summary taxonomy taxon 186538 | jq -r .reports[].taxonomy.genomic_moltype
ssRNA(-)

Let me know if you have any questions.

Nuala

olearyna avatar Jun 26 '24 15:06 olearyna

Dear @olearyna,

that is perfect, thank you :-)

Best, Daniel

dandaman avatar Jun 27 '24 05:06 dandaman