datasets
datasets copied to clipboard
resume option
I'm experiencing disrupted downloads. Is there a resume option to resume downloading? My commandline: datasets download genome taxon 2 --filename taxon_2_bacteria.zip
my goal is to download "all" bacteria, even those that are not annotated or only available as contigs. I don't experience issues when downloading reference genomes, but basically we want the opposite, All non-reference bacterial genomes.
I've tried rsync based on the assembly summaries file in https://ftp.ncbi.nlm.nih.gov/genomes/genbank/ but that does not seem to give me the whole set.
Any options or suggestions here?
Hi mibwurmoleco,
Thanks for your feedback. There is currently no option to resume a download but adding that feature is in our long-term plan for the datasets command-line tool.
Would you be interested in having a quick zoom meeting later this week so we can talk about the data that you are looking for? Please feel free to send me an email at [email protected] and I'd be happy to set something up!
Best, Eric
Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI [email protected]
I'm experiencing disrupted downloads. Is there a resume option to resume downloading? My commandline: datasets download genome taxon 2 --filename taxon_2_bacteria.zip
my goal is to download "all" bacteria, even those that are not annotated or only available as contigs. I don't experience issues when downloading reference genomes, but basically we want the opposite, All non-reference bacterial genomes.
I've tried rsync based on the assembly summaries file in https://ftp.ncbi.nlm.nih.gov/genomes/genbank/ but that does not seem to give me the whole set.
Any options or suggestions here?
I have the same issue.
Hi Ömer,
Thanks for your feedback. For your information, All bacterial genomes will include ~1.3 M genomes, which could come to 15-20 TB of data, and may take 7-10 days to download. With that said, I would recommend using our protocol for downloading large genome data packages. Note that this protocol is resumable.
For all bacteria, this will look something like this:
# Download a dehydrated genome data package for all bacteria
datasets download genome taxon 2 --dehydrated --filename bacteria.zip
# Unzip the package
unzip bacteria.zip -d bacteria
# Rehydrate the package (Get the data from NCBI)
datasets rehydrate --directory bacteria/
I hope that helps. Let me know if you have any other questions.
Best, Eric Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI [email protected]
Hi Eric,
Thanks for the reply.
I am already using that protocol, however I doubt the download is resumable.
It fails randomly due to the error below, and rerunning the last command with rehydrate
starts over downloading all the links from the fetch file:
taxon="bacteria"
datasets download genome taxon "$taxon" --filename "$taxon.zip" --dehydrated
unzip "$taxon.zip" -d "ncbi/$taxon"
rm "$taxon.zip"
datasets rehydrate --directory "ncbi/$taxon" --max-workers 20
goroutine 114 [running]:
datasets_cli/v2/datasets.downloadFileWorker.func2(0xc000537f48, 0xc000a26540, 0xc000537f30, 0x0?, 0x0?)
apps/public/Datasets/v2/datasets/datasets/Rehydrate.go:190 +0x2c1
datasets_cli/v2/datasets.downloadFileWorker(0x0?, 0x0?, 0xc00001e3c0)
apps/public/Datasets/v2/datasets/datasets/Rehydrate.go:217 +0xf7
created by datasets_cli/v2/datasets.downloadMultipleFiles
apps/public/Datasets/v2/datasets/datasets/Rehydrate.go:245 +0x14b
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x8f08a1]
How can I make it resumable to skip already downloaded genomes?
Hi Ömer,
Our developers went back to look at the code and we have identified the bug that affects the resumption of downloads. We are working on a fix and hope to release it soon. I will comment on this issue when it is ready.
Thanks, Eric