datasets
datasets copied to clipboard
Include more metadata from genbank files in virus reports, e.g. `/note` and `/strain`
Quite frequently, valuable metadata is contained in the genbank file field '/note`.
Unfortunately, this field seems to get lost on the way to 'datasets download virus genome'
Consider the metadata available for the genbank file under SOURCE:
FEATURES Location/Qualifiers
source 1..2408
/organism="Zaire ebolavirus"
/mol_type="genomic RNA"
/strain="Mayinga 1976"
/db_xref="taxon:[186538](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=186538)"
/note="subtype: Zaire"
with what ends up in datasets download virus genome taxon
:
{
"accession": "U28077.1",
"completeness": "PARTIAL",
"isAnnotated": true,
"length": 2408,
"nucleotide": {
"sequenceHash": "62B2211F"
},
"proteinCount": 2,
"releaseDate": "1995-10-26T00:00:00Z",
"sourceDatabase": "GenBank",
"submitter": {
"affiliation": "Anthony Sanchez, Special Pathogens Branch, Division of Viral and Rickettsial Diseases, Centers for Disease Control and Prevention, 1600 Clifton Road, Blgd. 15, Room SB611, Atlanta, GA 30333",
"country": "USA",
"names": [
"Sanchez,A.",
"Trappier,S.G.",
"Mahy,B.W.",
"Peters,C.J.",
"Nichol,S.T."
]
},
"updateDate": "2002-08-28T00:00:00Z",
"virus": {
"organismName": "Zaire ebolavirus",
"taxId": 186538
}
Valuable information is lost:
-
/note="subtype: Zaire"
-
/strain="Mayinga 1976"
-
/mol_type="genomic RNA"
This is probably not even such a good example, I can think of more important notes but couldn't find an example just now.
It would be nice, if all this metadata was passed through.
In fact, it might be a bug that molType
is missing, as that is a field that should already be output per the schema here: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/data-reports/virus/