datasets
datasets copied to clipboard
`Submitter Names` lack spaces in `dataformat tsv virus-genome` compared to genbank file serialization
Describe the bug
The Submitter Names
field has a not-very-robust serialization format of LAST NAME,FIRST NAME INITIALS,LAST NAME,FIRST NAME INITIALS...
that does not separate individuals. Is this on purpose, if so why?
When I look up the original genbank file for a sequence, there is a space after the initials, before the next Last Name.
Compare output from
datasets download virus genome taxon 186538 --no-progressbar --filename results/ncbi_dataset.zip
dataformat tsv virus-genome --package results/ncbi_dataset.zip --fields submitter-names
for e.g. OR084927
with what's shown for the corresponding .gb
file.
CLI output: Kinganda-Lusamaki,E.,Whitmer,S.,Lokilo-Lofiko,E.,Amuri-Aziza,A.,Muyembe-Mawete,F.,Makangara-Cigolo,J.C.,...
Genbank file: Kinganda-Lusamaki,E., Whitmer,S., Lokilo-Lofiko,E., Amuri-Aziza,A., Muyembe-Mawete,F., Makangara-Cigolo,J.C.,
Note that the Genbank file separates names with a whitespace - which is prudent, as otherwise one needs to hope that the parity holds for long strings.