datasets
datasets copied to clipboard
Flag(s) to Remove Outdated Assemblies
Hello,
Here are a few improvements that, in my opinion, could greatly simplify and improve downstream processing:
-
Is it possible to add "--latest-version" for
datasets <summary/download> genome<taxon/accession>
command? Right now, if I want to download all assemblies that are associated with, for example, viruses, I have to sort through allassembly_accession
s and select only most the recent versions.
Here is my example command:datasets summary genome taxon "viruses" --assembly-source refseq --assembly-level complete_genome,chromosome > <some_file.json>
. Inside resulted JSON file I get both "GCF_003029025.2", "GCF_003029025.1". I am only interested in the latest assembly version, so I parse everyassembly_accession
and look for the greatest number after the dot ("1" VS "2" in this example) and only keep that assembly. This method, however, becomes problematic in some unusual cases when a new version of assembly changes its assembly level (for instance, GCF_001736955.1 - complete genome, GCF_001736955.2 - scaffold). Command above we only grab the "suppressed" assembly version and, therefore, my parsing approach will not work as expected (it will keep the "suppressed" assembly). -
Is it possible to add "--ignore-legacy" for the same command as in the first question? I ran the same command but without
--assembly-level
flag. I have noticed that one of the assemblies is a "removed record" (GCF_000869025.1). There is currently no way for a user to know about the status of the assembly (latest, removed, suppressed) unless they check every single result through different means other thandatasets
.
I am currently using ncbi-datasets-cli
version == 13.10.0 from conda.
Thank you, Vasily
Hi Vasily,
Thanks for opening this issue.
We are currently in the process of adding additional genomes, and are still considering which genomes to show by default and which genomes will be available when using different filters.
Regarding your feature request for a --latest-version
flag: most likely, we will only be showing the latest genome versions by default, while an optional flag can be specified to get previous versions of genomes.
Similarly, we will most likely omit removed and suppressed genomes by default, and allow users to get these genomes using an optional flag.
We hope to add the additional genomes and implement the new default behavior soon, tentatively in the next 2-4 weeks, and would be happy to get your feedback on changes to the command-line tool when it’s ready. I’ll comment on this issue when we make these updates.
Best, Eric
Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI [email protected]
Hi Eric,
That all sounds wonderful! Keep me posted!
Best, Vasily