datasets icon indicating copy to clipboard operation
datasets copied to clipboard

datasets download genome: Gateway Error (408 Request Timeout)

Open janpb opened this issue 2 years ago • 3 comments

Downloading dehydrated genome datasets fails with a request timeout.

command: datasets download genome taxon Viridiplantae --annotated --assembly-level chromosome,complete_genome,scaffold --dehydrated --exclude-rna --exclude-seq --reference --filename viridiplantae.zip

expected: viridiplantae.zip with ncbi_dataset directories for rehydration

Result: Error: Error: [gateway] Gateway Error (408 Request Timeout)

I tried on different machines (home, campus). According to the datasets website this should include approx. 220 datasets/species and be less than 20Mb. I'm not sure if the query takes too long on the NCBI side or if this is a local issue (on my side). Because I see the same error on the university campus I doubt it's a local network issue.

I also tested the command with and without --dehydrate and restricted the download to annotation files only, with the same result.

datasets version: 12.4.0 uname -a: Linux 5.12.14-arch1-1 x86_64 GNU/Linux go version go1.16.5 linux/amd64

P.S. datasets and dataformat are very useful tools. Thanks heaps.

janpb avatar Jul 22 '21 19:07 janpb

Hi Dr. Buchmann,

Thanks for your feedback. Sorry to hear you're having trouble downloading a genome data package for plants.

I was unable to reproduce the error from my home computer. It may be possible that you encountered a transient issue. Could you please try again and let me know what happens?

Best, Eric

Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI

ericcox1 avatar Jul 26 '21 17:07 ericcox1

Dear Dr. Cox,

thanks for your reply. I rerun the command several times without success, i.e., always getting Error: [gateway] Gateway Error (408 Request Timeout).

I could get the dehydrated zip file via the datasets genome website and download the data but got a GOAWAY error during the first attempt (similar to #50). It did work after redownloading the data. Getting the data without using the website would be more beneficial, though.

Because the file size of the dehydrated zip file is very small and very likely is not the problem, I did some simple testing with different taxa size and it seems the farther away I'm from a taxonomical leaf node I get timeout errors. The number of expected genomes was taken from the datasets website

Taxon genomes time [s] Status
Viridiplantae 225 66.15 Timeout
Mesangiospermae 197 66.43 Timeout
eudicotyledons 158 66.21 Timeout
rosids 114 65.97 Timeout
Brassiceae 7 6.35 Success
Arabidopsis 4 6.34 Success

Timeout: Error: [gateway] Gateway Error (408 Request Timeout) Success: no error, successfully downloaded dehydrated zip archive

I used the following command, replacing $taxon with the corresponding taxon in the table.

/usr/bin/time -p  datasets download genome taxon $taxon --annotated --assembly-level chromosome,complete_genome,scaffold --dehydrated --exclude-rna --exclude-seq --reference --filename taxon.zip
Error: [gateway] Gateway Error (408 Request Timeout)
real 66.15
user 0.06
sys 0.01

janpb avatar Jul 27 '21 06:07 janpb

Thanks for the detailed testing data. This is really helpful. We are continuing to look into the issue.

-Eric

ericcox1 avatar Jul 27 '21 14:07 ericcox1