ncbi-genome-download
ncbi-genome-download copied to clipboard
Enhancement to --human-readable output
Firstly, let me say thank you for adding the human readable output option. It is working great so far with my testing!
I was hoping you could add another option to make the human readable name in the file name as well? This makes it easy to provide to software a bunch of these files and have output we can understand. Not quite sure how versions would work though.
Current:
human_readable/genbank/bacteria/Dichelobacter/nodosus/VCS1703A/GCA_000015345.1_ASM1534v1_genomic.gbff.gz
Alternate suggested:
human_readable/genbank/bacteria/Dichelobacter/nodosus/VCS1703A/Dichelobacter_nodosus_VCS1703A.gbff.gz
I deliberately left this out in the initial version because I needed some more time to think about the naming scheme there. And as "naming things" is one of the two biggest problems in computer science (along with "halting problem" and "off by one errors"), this might take a while.
As I was planning to push a release out to cope with NCBI's decision to discontinue plain HTTP downloads at the end of September, I didn't want to wait until I had a good idea on dealing with the names.
The --human-readable
feature still has a couple of things that could be improved, I'll keep this on the radar.
It would be great to be able to get an Aspera or Rsync version in, that way you could construct a list of files to download and pass it to ascp
or rsync
and let it take care of the MD5SUM stuff! Aspera also has rsync like behaviour where it won't re-download stuff you have.
IIRC, rsync still doesn't do parallel transfers. Don't know the other tool.
I spent a lot of time hacking together my own tool https://github.com/andrewsanchez/NCBITK
@tseemann It works by making a subprocess call to rsync.
Needless to say, I only recently discovered ncbi-genome-download. Great work! Had I thought to search github repos for "ncbi," I would have saved myself a lot of time πΆ
@kblin - I'd like to talk about maybe abandoning my project and joining forces.
My documentation is very limited at the moment, but you can find my renaming function here:
If anybody is interested in trying it out, I can provide some simple instructions, and change it to handle other file extensions if necessary.
The renaming scheme is:
name = '{}_{}_{}_{}_{}_{}.fasta'.format(genome_id, organism_name, scientific_name, infraspecific_name, isolate, assembly_level)
The scientific name is found by looking for the species tax id in the taxonomy dump. The rest of the values are taken from the assembly summary file. Currently only works with bacteria.
Here's a couple examples of the naming scheme:
GCA_000460675.1_Escherichia_coli_UMEA_3190_1_Scaffold.fasta
GCA_001660585.1_Escherichia_coli_NGF1_Complete_Genome.fasta
Thanks, Andrew. Will have a look once I have a chance!
Hi Andrew, I'll get back to you a next week, as I'm currently out of office. But sure, I'm happy to join forces. Let's see what features are still missing in ncbi-genome-download
that are covered in NCBITK
, and how to best integrate those.
I still also have my script which pre-dates @kblin one but was very hacky and not released, but managed to clean it up in 20 mins to be much better. It generates a single Aspera ascp
call from a SOURCE DEST
table, and does rsync-like
incremental updates. It is VERY fast, even to Australia. But I am focussing only on microbes. I also need to add some hacks to support plasmids
that aren't encompassed within bacteria
.
As I said it's very similar to this tool here. It also generates a makefile so you can incrementally keep a tree of uncompressed GBK, GFF3 and FNA derived from that .gbk.gz I mirror.
I think mine overlaps with @andrewsanchez and @kblin
my $BASEURL = 'ftp://ftp.ncbi.nlm.nih.gov/genomes';
my $SUMMARY = 'assembly_summary.txt';
my @DIVISION = qw(refseq genbank);
my @KINGDOM = qw(viral archaea bacteria fungi protozoa);
my $ASPERA = "ascp -v -T -k 1 -l 300m -i /home/linuxbrew/.linuxbrew/etc/asperaweb_id_dsa.openssh";
Yeah, I think generating a list of things to rsync
and using the --files-from
parameter would be the way to go if rsync support is what is desired. I personally have no interest in supporting non-OSS tools, but I certainly would be happy to accept patches adding support for aspera as well.
The main reason that I didn't add rsync
support so far is that rsync
transfers are single-threaded, unless you spend some extra work to split the source files and then use a process pool to fork and execute multiple rsync
processes. I don't see a big benefit over the existing HTTP transfer method.
That said, can we move the discussion of alternative download strategies into a different issue? I'd like to keep this one about human-readable file names.
Any progress on adding human readable file names?
Sorry, this didn't bubble up far enough on my todo list yet. Would the naming scheme suggested by @andrewsanchez work for you?
I would like to begin work on implementing this. Any suggestions on how you could see this fitting in @kblin? Do you see it as simply being a part of the --human-readable
?
I'm also interested in implementing an alternative human readable directory structure in the following format:
bacteria/
βββAbiotrophia_defectiva
βββ GCA_000160075.2_Abiotrophia_defectiva_ATCC_49176_Scaffold.fasta
I wonder if anybody else would prefer the above, less deeply nested directory structure over the current implementation of the human readable option:
bacteria/
βββ Abiotrophia
βββ defectiva
βββ abc123
βββ GCA_000160075.2_Abiotrophia_defectiva_ATCC_49176_Scaffold.fasta
The reason I went for the deep nesting we have right now is that there's hundreds of E. coli genomes, and with the flat scheme they would all end up in the same directory. In my opinion, that makes it harder to find things, not easier.
That said, I do realise that people who navigate folder structures in a file browser probably prefer the flat structure that involves less clicking to get anywhere.
My first reaction was to suggest making this configurable, but from a UX design perspective, that's the cowardly choice. "I don't know, I'll make it a config option." So I'd actually be interested in input from other people.
directory tree == taxonomic tree
. That is the current implementation and that seems preferable over a flattened format IMHO.
I think at least one level of nesting can be removed. What is the value of placing the fna
file under the GCA_003741385.1
directory in the example below?
genbank/bacteria/GCA_003741385.1/GCA_003741385.1_PDT000396594.1_genomic.fna.gz
As for giving files human-readable names: if you are willing to make pandas
a dependency, @kblin, it would be easy to give linked genomes in the human-readable directory a more verbose name based on info in the assembly summary.
Any news on this at all? Any recommendations for those of us with little know-how?