datasets icon indicating copy to clipboard operation
datasets copied to clipboard

`Submitter Names` lack spaces in `dataformat tsv virus-genome` compared to genbank file serialization

Open corneliusroemer opened this issue 3 months ago • 1 comments

Describe the bug The Submitter Names field has a not-very-robust serialization format of LAST NAME,FIRST NAME INITIALS,LAST NAME,FIRST NAME INITIALS... that does not separate individuals. Is this on purpose, if so why?

When I look up the original genbank file for a sequence, there is a space after the initials, before the next Last Name.

Compare output from

   datasets download virus genome taxon 186538  --no-progressbar  --filename results/ncbi_dataset.zip
 dataformat tsv virus-genome   --package results/ncbi_dataset.zip  --fields submitter-names

for e.g. OR084927 with what's shown for the corresponding .gb file.

CLI output: Kinganda-Lusamaki,E.,Whitmer,S.,Lokilo-Lofiko,E.,Amuri-Aziza,A.,Muyembe-Mawete,F.,Makangara-Cigolo,J.C.,... Genbank file: Kinganda-Lusamaki,E., Whitmer,S., Lokilo-Lofiko,E., Amuri-Aziza,A., Muyembe-Mawete,F., Makangara-Cigolo,J.C.,

Note that the Genbank file separates names with a whitespace - which is prudent, as otherwise one needs to hope that the parity holds for long strings.

corneliusroemer avatar Mar 21 '24 18:03 corneliusroemer