datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Flag(s) to Remove Outdated Assemblies

Open tokarevvasily opened this issue 2 years ago • 2 comments

Hello,

Here are a few improvements that, in my opinion, could greatly simplify and improve downstream processing:

  1. Is it possible to add "--latest-version" for datasets <summary/download> genome<taxon/accession> command? Right now, if I want to download all assemblies that are associated with, for example, viruses, I have to sort through all assembly_accessions and select only most the recent versions.
    Here is my example command: datasets summary genome taxon "viruses" --assembly-source refseq --assembly-level complete_genome,chromosome > <some_file.json>. Inside resulted JSON file I get both "GCF_003029025.2", "GCF_003029025.1". I am only interested in the latest assembly version, so I parse every assembly_accession and look for the greatest number after the dot ("1" VS "2" in this example) and only keep that assembly. This method, however, becomes problematic in some unusual cases when a new version of assembly changes its assembly level (for instance, GCF_001736955.1 - complete genome, GCF_001736955.2 - scaffold). Command above we only grab the "suppressed" assembly version and, therefore, my parsing approach will not work as expected (it will keep the "suppressed" assembly).

  2. Is it possible to add "--ignore-legacy" for the same command as in the first question? I ran the same command but without --assembly-level flag. I have noticed that one of the assemblies is a "removed record" (GCF_000869025.1). There is currently no way for a user to know about the status of the assembly (latest, removed, suppressed) unless they check every single result through different means other than datasets.

I am currently using ncbi-datasets-cli version == 13.10.0 from conda.

Thank you, Vasily

tokarevvasily avatar Apr 08 '22 21:04 tokarevvasily

Hi Vasily,

Thanks for opening this issue.

We are currently in the process of adding additional genomes, and are still considering which genomes to show by default and which genomes will be available when using different filters.

Regarding your feature request for a --latest-version flag: most likely, we will only be showing the latest genome versions by default, while an optional flag can be specified to get previous versions of genomes.

Similarly, we will most likely omit removed and suppressed genomes by default, and allow users to get these genomes using an optional flag.

We hope to add the additional genomes and implement the new default behavior soon, tentatively in the next 2-4 weeks, and would be happy to get your feedback on changes to the command-line tool when it’s ready. I’ll comment on this issue when we make these updates.

Best, Eric

Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI [email protected]

ericcox1 avatar Apr 13 '22 13:04 ericcox1

Hi Eric,

That all sounds wonderful! Keep me posted!

Best, Vasily

tokarevvasily avatar Apr 14 '22 19:04 tokarevvasily