biomartr
biomartr copied to clipboard
Performance issue
Is it just in my case, or is the download of entire groups with meta.retreval
such as
meta.retrieval(kingdom = "virus", db = "refseq", type = "genome")
very slow? I have been trying to download all viruses (which aborted after a while, so I build an inifinite loop with repeat
, not a nice thing, but working), but this is now running since a week or so. Do you have any idea on how to improve the speed?
I think it should not be a limitation with my network speed because the tool ncbi-genome-download
is rather quick (2-3 hours, although it produces names that are difficult to parse, therefore I would rather like to use biomartr
).
Thank you very much.
Hi Johannes,
Thank you for making me aware of this and I will look into this issue.
A lot of small updates for biomartr
are long overdue anyway and I hope to be able to sit down and get some coding done soon, so in this context I will also check what may happen there.
Best wishes, Hajk
If I can help in any way with the package, let me know. I like it a lot. :-)
Hi Johannes,
Many thanks for your feedback!
Could the default timeout limit of 60s explain this issue?
You can reset the timeout limit via:
options(timeout = 30000) For example, the NCBI species summary file for bacteria is >300MB and with a slow connection the 60sec timeout limit is easily reached.
If you now install the developer version of biomartr, the timeout limit is now internally handled and set to 30000 by default.
Does this help?
Cheers, Hajk
Thank you, the current developer version solved that issue, well, partly at least.
But it still does not download all genomes. After the third attempt, it aborts after Circoviridae 17 LDMD-2013
Starting genome retrieval of 'Circoviridae 17 LDMD-2013' from refseq ...
The FTP link: 'na/na_genomic.fna.gz' seems not to be available at the moment. This might be due to an instable internet connection, a firewall issue, or wrong organism name. Could you please try to re-run the function to see whether it works now?
The FTP link: 'na/md5checksums.txt' seems not to be available at the moment. This might be due to an instable internet connection, a firewall issue, or wrong organism name. Could you please try to re-run the function to see whether it works now?
Genome download of Circoviridae_17_LDMD-2013 is completed!
The download session seems to have timed out at the FTP site 'na/na_genomic.fna.gz'. This could be due to an overload of queries to the databases. Please restart this function to continue the data retrieval process or wait for a while before restarting this function in case your IP address was logged due to an query overload on the server side.
Error: Please provide a valid file path to your genome assembly file.
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Another thing I noted is that some downloads are skipped in the beginning, some are omitted while the command runs, e.g.
(...)
File viral/Paenibacillus_phage_phiIBB_P123_genomic_refseq.fna.gz exists already. Thus, download has been skipped.
The genome of 'Paenibacillus_phage_phiIBB_P123' has been downloaded to 'viral' and has been named 'Paenibacillus_phage_phiIBB_P123_genomic_refseq.fna.gz'.
(...)
Finally, I noted that there seems to be a bug while checking for the genomes that are skipped in the beginning (probably this should be moved to a new issue):
kipping already downloaded species: Abaca bunchy top virus, It seems like there are some files in download folder that are neither pre-downloaded species files nor d
oc_, documentation, or md5checksum files., Abalone herpesvirus Victoria AUS 2009, It seems like there are some files in download folder that are neither pre-downloade
d species files nor doc_, documentation, or md5checksum files., Abalone shriveling syndrome-associated virus, It seems like there are some files in download folder th
at are neither pre-downloaded species files nor doc_, documentation, or md5checksum files., Abelson murine leukemia virus, It seems like there are some files in downl
oad folder that are neither pre-downloaded species files nor doc_, documentation, or md5checksum files., Abutilon Brazil virus, It seems like there are some files in
download folder that are neither pre-downloaded species files nor doc_, documentation, or md5checksum files.,
So after each already downloaded species, the string
It seems like there are some files in
download folder that are neither pre-downloaded species files nor doc_, documentation, or md5checksum files.
appears.
For the record, I only tested with viral genomes.
Brilliant! Thank you so much for this thorough analysis! This should not happen and I will further look into this!
Dear @johanneswerner
This issue should be solved now with the new version:
meta.retrieval(kingdom = "viral", db = "refseq", type = "genome")
Many thanks, Hajk