genome_updater icon indicating copy to clipboard operation
genome_updater copied to clipboard

Is there a way to separate genomes from .genomic.fna.gz files?

Open jmwhitha opened this issue 5 years ago • 2 comments

Hi Vitor,

Thanks for making this application.

I was wondering if there is a way to use it so that I can separate the genomes once I've downloaded the genomic.fna.gz files? I have tried to use awk but the formatting varies a good bit for genomes. As you probably know, sometimes the descriptions have "sp." or "strain", sometimes they have "Scaffolds" or "contigs", etc., which makes it hard but not impossible to separate individual genomes.

If your application cannot separate the genomes either, are you familiar with any applications or scripts that can?

Thank you, Jason

jmwhitha avatar Aug 31 '20 19:08 jmwhitha

Hi Jason,

There's currently no way to do that with genome_updater.

I believe you could parse the assembly_summary.txt file of the current version and get the information you need to separate the files. Check the fields 9 and 12, more info here: ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt

In the assembly_summary.txt, the first column is the assembly accession which points you to the file downloaded with genome_updater if you use: {output_dir}/{version}/files/{assembly_accession}*genomic.fna.gz

I hope that helps, I will leave this issue open and mark this an enhancement so I may include some of those features in the next release.

Best Vitor

pirovc avatar Sep 01 '20 07:09 pirovc

Thank you so much for pointing me to the assembly_summary.txt. This seems like a good starting point to a solution.

Looking forward to the enhancement!

jmwhitha avatar Sep 02 '20 11:09 jmwhitha