datasets resume option

resume option

Open mibwurmoleco opened this issue 2 years ago • 1 comments

I'm experiencing disrupted downloads. Is there a resume option to resume downloading? My commandline: datasets download genome taxon 2 --filename taxon_2_bacteria.zip

my goal is to download "all" bacteria, even those that are not annotated or only available as contigs. I don't experience issues when downloading reference genomes, but basically we want the opposite, All non-reference bacterial genomes.

I've tried rsync based on the assembly summaries file in https://ftp.ncbi.nlm.nih.gov/genomes/genbank/ but that does not seem to give me the whole set.

Any options or suggestions here?

Dec 20 '21 08:12 mibwurmoleco

Hi mibwurmoleco,

Thanks for your feedback. There is currently no option to resume a download but adding that feature is in our long-term plan for the datasets command-line tool.

Would you be interested in having a quick zoom meeting later this week so we can talk about the data that you are looking for? Please feel free to send me an email at [email protected] and I'd be happy to set something up!

Best, Eric

Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI [email protected]

Dec 20 '21 16:12 ericcox1

I'm experiencing disrupted downloads. Is there a resume option to resume downloading? My commandline: datasets download genome taxon 2 --filename taxon_2_bacteria.zip

my goal is to download "all" bacteria, even those that are not annotated or only available as contigs. I don't experience issues when downloading reference genomes, but basically we want the opposite, All non-reference bacterial genomes.

I've tried rsync based on the assembly summaries file in https://ftp.ncbi.nlm.nih.gov/genomes/genbank/ but that does not seem to give me the whole set.

Any options or suggestions here?

I have the same issue.

Oct 25 '22 09:10 bounlu

Hi Ömer,

Thanks for your feedback. For your information, All bacterial genomes will include ~1.3 M genomes, which could come to 15-20 TB of data, and may take 7-10 days to download. With that said, I would recommend using our protocol for downloading large genome data packages. Note that this protocol is resumable.

For all bacteria, this will look something like this:

# Download a dehydrated genome data package for all bacteria
datasets download genome taxon 2 --dehydrated --filename bacteria.zip

# Unzip the package
unzip bacteria.zip -d bacteria

# Rehydrate the package (Get the data from NCBI)
datasets rehydrate --directory bacteria/

I hope that helps. Let me know if you have any other questions.

Best, Eric Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI [email protected]

Oct 26 '22 20:10 ericcox1

Hi Eric,

Thanks for the reply.

I am already using that protocol, however I doubt the download is resumable.

It fails randomly due to the error below, and rerunning the last command with rehydrate starts over downloading all the links from the fetch file:

taxon="bacteria"
datasets download genome taxon "$taxon" --filename "$taxon.zip" --dehydrated
unzip "$taxon.zip" -d "ncbi/$taxon"
rm "$taxon.zip"
datasets rehydrate --directory "ncbi/$taxon" --max-workers 20

goroutine 114 [running]:
datasets_cli/v2/datasets.downloadFileWorker.func2(0xc000537f48, 0xc000a26540, 0xc000537f30, 0x0?, 0x0?)
	apps/public/Datasets/v2/datasets/datasets/Rehydrate.go:190 +0x2c1
datasets_cli/v2/datasets.downloadFileWorker(0x0?, 0x0?, 0xc00001e3c0)
	apps/public/Datasets/v2/datasets/datasets/Rehydrate.go:217 +0xf7
created by datasets_cli/v2/datasets.downloadMultipleFiles
	apps/public/Datasets/v2/datasets/datasets/Rehydrate.go:245 +0x14b
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x8f08a1]

How can I make it resumable to skip already downloaded genomes?

Oct 26 '22 22:10 bounlu

Hi Ömer,

Our developers went back to look at the code and we have identified the bug that affects the resumption of downloads. We are working on a fix and hope to release it soon. I will comment on this issue when it is ready.

Thanks, Eric

Oct 27 '22 14:10 ericcox1

datasets datasets copied to clipboard

resume option

datasets
datasets copied to clipboard