ncbi-genome-download icon indicating copy to clipboard operation
ncbi-genome-download copied to clipboard

Enhancement to --human-readable output

Open tseemann opened this issue 8 years ago β€’ 17 comments

Firstly, let me say thank you for adding the human readable output option. It is working great so far with my testing!

I was hoping you could add another option to make the human readable name in the file name as well? This makes it easy to provide to software a bunch of these files and have output we can understand. Not quite sure how versions would work though.

Current: human_readable/genbank/bacteria/Dichelobacter/nodosus/VCS1703A/GCA_000015345.1_ASM1534v1_genomic.gbff.gz

Alternate suggested: human_readable/genbank/bacteria/Dichelobacter/nodosus/VCS1703A/Dichelobacter_nodosus_VCS1703A.gbff.gz

tseemann avatar Sep 10 '16 01:09 tseemann

I deliberately left this out in the initial version because I needed some more time to think about the naming scheme there. And as "naming things" is one of the two biggest problems in computer science (along with "halting problem" and "off by one errors"), this might take a while.

As I was planning to push a release out to cope with NCBI's decision to discontinue plain HTTP downloads at the end of September, I didn't want to wait until I had a good idea on dealing with the names.

The --human-readable feature still has a couple of things that could be improved, I'll keep this on the radar.

kblin avatar Sep 10 '16 05:09 kblin

It would be great to be able to get an Aspera or Rsync version in, that way you could construct a list of files to download and pass it to ascp or rsync and let it take care of the MD5SUM stuff! Aspera also has rsync like behaviour where it won't re-download stuff you have.

tseemann avatar Sep 11 '16 05:09 tseemann

IIRC, rsync still doesn't do parallel transfers. Don't know the other tool.

kblin avatar Sep 11 '16 20:09 kblin

I spent a lot of time hacking together my own tool https://github.com/andrewsanchez/NCBITK

@tseemann It works by making a subprocess call to rsync.

Needless to say, I only recently discovered ncbi-genome-download. Great work! Had I thought to search github repos for "ncbi," I would have saved myself a lot of time 😢

@kblin - I'd like to talk about maybe abandoning my project and joining forces.

My documentation is very limited at the moment, but you can find my renaming function here:

If anybody is interested in trying it out, I can provide some simple instructions, and change it to handle other file extensions if necessary.

The renaming scheme is: name = '{}_{}_{}_{}_{}_{}.fasta'.format(genome_id, organism_name, scientific_name, infraspecific_name, isolate, assembly_level)

The scientific name is found by looking for the species tax id in the taxonomy dump. The rest of the values are taken from the assembly summary file. Currently only works with bacteria.

Here's a couple examples of the naming scheme: GCA_000460675.1_Escherichia_coli_UMEA_3190_1_Scaffold.fasta
GCA_001660585.1_Escherichia_coli_NGF1_Complete_Genome.fasta

andrewsanchez avatar Aug 16 '17 15:08 andrewsanchez

Thanks, Andrew. Will have a look once I have a chance!

dutchscientist avatar Aug 16 '17 16:08 dutchscientist

Hi Andrew, I'll get back to you a next week, as I'm currently out of office. But sure, I'm happy to join forces. Let's see what features are still missing in ncbi-genome-download that are covered in NCBITK, and how to best integrate those.

kblin avatar Aug 16 '17 22:08 kblin

I still also have my script which pre-dates @kblin one but was very hacky and not released, but managed to clean it up in 20 mins to be much better. It generates a single Aspera ascp call from a SOURCE DEST table, and does rsync-like incremental updates. It is VERY fast, even to Australia. But I am focussing only on microbes. I also need to add some hacks to support plasmids that aren't encompassed within bacteria.

As I said it's very similar to this tool here. It also generates a makefile so you can incrementally keep a tree of uncompressed GBK, GFF3 and FNA derived from that .gbk.gz I mirror.

I think mine overlaps with @andrewsanchez and @kblin

my $BASEURL = 'ftp://ftp.ncbi.nlm.nih.gov/genomes';
my $SUMMARY = 'assembly_summary.txt';
my @DIVISION = qw(refseq genbank);
my @KINGDOM = qw(viral archaea bacteria fungi protozoa);
my $ASPERA = "ascp -v -T -k 1 -l 300m -i /home/linuxbrew/.linuxbrew/etc/asperaweb_id_dsa.openssh";

tseemann avatar Aug 16 '17 23:08 tseemann

Yeah, I think generating a list of things to rsyncand using the --files-from parameter would be the way to go if rsync support is what is desired. I personally have no interest in supporting non-OSS tools, but I certainly would be happy to accept patches adding support for aspera as well.

The main reason that I didn't add rsync support so far is that rsync transfers are single-threaded, unless you spend some extra work to split the source files and then use a process pool to fork and execute multiple rsync processes. I don't see a big benefit over the existing HTTP transfer method.

kblin avatar Aug 17 '17 07:08 kblin

That said, can we move the discussion of alternative download strategies into a different issue? I'd like to keep this one about human-readable file names.

kblin avatar Aug 17 '17 07:08 kblin

Any progress on adding human readable file names?

CarolineOhrman avatar Jan 08 '18 14:01 CarolineOhrman

Sorry, this didn't bubble up far enough on my todo list yet. Would the naming scheme suggested by @andrewsanchez work for you?

kblin avatar Jan 11 '18 16:01 kblin

I would like to begin work on implementing this. Any suggestions on how you could see this fitting in @kblin? Do you see it as simply being a part of the --human-readable?

andrewsanchez avatar Jan 26 '18 21:01 andrewsanchez

I'm also interested in implementing an alternative human readable directory structure in the following format:

bacteria/
└──Abiotrophia_defectiva
   └── GCA_000160075.2_Abiotrophia_defectiva_ATCC_49176_Scaffold.fasta

I wonder if anybody else would prefer the above, less deeply nested directory structure over the current implementation of the human readable option:

bacteria/
└── Abiotrophia
    └── defectiva
        └── abc123
            └── GCA_000160075.2_Abiotrophia_defectiva_ATCC_49176_Scaffold.fasta

andrewsanchez avatar Jan 26 '18 21:01 andrewsanchez

The reason I went for the deep nesting we have right now is that there's hundreds of E. coli genomes, and with the flat scheme they would all end up in the same directory. In my opinion, that makes it harder to find things, not easier.

That said, I do realise that people who navigate folder structures in a file browser probably prefer the flat structure that involves less clicking to get anywhere.

My first reaction was to suggest making this configurable, but from a UX design perspective, that's the cowardly choice. "I don't know, I'll make it a config option." So I'd actually be interested in input from other people.

kblin avatar Jan 29 '18 10:01 kblin

directory tree == taxonomic tree. That is the current implementation and that seems preferable over a flattened format IMHO.

rhpvorderman avatar Feb 19 '18 15:02 rhpvorderman

I think at least one level of nesting can be removed. What is the value of placing the fna file under the GCA_003741385.1 directory in the example below?

genbank/bacteria/GCA_003741385.1/GCA_003741385.1_PDT000396594.1_genomic.fna.gz

As for giving files human-readable names: if you are willing to make pandas a dependency, @kblin, it would be easy to give linked genomes in the human-readable directory a more verbose name based on info in the assembly summary.

andrewsanchez avatar May 11 '19 20:05 andrewsanchez

Any news on this at all? Any recommendations for those of us with little know-how?

Joshua-Macleod avatar Jun 21 '23 18:06 Joshua-Macleod