datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Error: Internal error (invalid zip archive). Please try again

Open Rittika1 opened this issue 2 years ago • 10 comments

Hi,

I'm trying to download protein sequences of all Vertebrates using this command. Its running on Redhat 8.3.

./datasets download genome taxon Vertebrates --exclude-rna --exclude-seq --exclude-gff3 --exclude-genomic-cds --filename Vertebrates.zip It starts off downloading okay, but everytime after downloading till the filesize of 5.2GB, it fails and gives this error. datasetdownload error

Why does this keep happening everytime? I know there is enough space in my folder, so it's not running out of space. Any help would be appreciated.

I have one more question: I want to download all the protein sequences of the Vertebrates in one file. the current command I'm using divides it into several folders, with each protein.faa file, and I concatenate all of them, to make one protein sequence file. Is there a way to download them all into one file or folder?

Thank you

Rittika1 avatar Jan 10 '22 09:01 Rittika1

Hi Rittika,

Thanks for your feedback.

For such a large amount of data, I would recommend downloading a dehydrated package then rehydrating it. For more details, see: How to download large genome data packages.

You can get proteins for all annotated vertebrate genomes in 3 steps:

  1. Download a dehydrated package (note that I have added the --annotated flag because protein sequences are only available for annotated genomes) datasets download genome taxon Vertebrates --exclude-rna --exclude-seq --exclude-gff3 --exclude-genomic-cds --annotated --dehydrated --filename vertebrate-proteins.zip
  2. Unzip the downloaded zip archive unzip vertebrate-proteins.zip -d vertebrate-proteins/
  3. Rehydrate to get all the data datasets rehydrate --directory vertebrate-proteins/

Note that there is no way to download all the protein sequences as a single file--you'll have to download and then concatenate to get the desired result. We are looking into adding this feature in the future.

Best, Eric

Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI [email protected]

ericcox1 avatar Jan 11 '22 18:01 ericcox1

Thank you Eric, this worked, and I could download the datasets. It would be very helpful to roll out the feature to download all files as one

Rittika1 avatar Jan 13 '22 18:01 Rittika1

Hi there, I am having the same issue, with the error "Error: Internal error (invalid zip archive). Please try again" However, I am attempting to download much smaller datasets - just ortholog downloads. This happens even for the example commands: datasets download ortholog gene-id 59272 The ortholog summary function appears to work fine. Any help would be appreciated. Thanks, Erica

ebur053 avatar Feb 22 '22 22:02 ebur053

Hi Erica,

Thanks for your feedback and sorry to hear that you're having trouble with the download. I tried running the command datasets download ortholog gene-id 59272 a few times from my home computer and I was unable to reproduce the problem. You may have encountered a transient network issue.

Best, Eric

Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI [email protected]

ericcox1 avatar Feb 23 '22 18:02 ericcox1

Hi,

The same issue happened when I downloaded all SARS-CoV-2 GenBank genomes.

截屏2022-06-14 17 54 13

I noticed that the datasets download virus genome command does not have the dehydrated/rehydrate option. Do you know why this is happening and how to fix it?

Thanks, Yexiao

yexiao-cheng avatar Jun 14 '22 10:06 yexiao-cheng

Thanks for making this great tool, super helpful.

I also have this issue. Is there a way to know when there is a network error?

I am running the example dataset (using the current version datasets version 13.40.0):

datasets download genome accession GCF_000001405.40 --dehydrated --exclude-rna --exclude-genomic-cds 
Downloading: ncbi_dataset.zip    2.15kB 14.4MB/s
Error: Internal error (invalid zip archive). Please try again

chriswyatt1 avatar Sep 20 '22 09:09 chriswyatt1

Hi, Thanks for contacting us and sorry to hear you got an error. We suspect its related to some technical glitches earlier today. It should be working now. Let us know if you are still getting errors.

Nuala

Nuala A. O'Leary, PhD Product Owner, NCBI Datasets National Center for Biotechnology Information, NLM, NIH, DHHS [Contractor] Building 45 Room 6As.41 Bethesda, MD 20892 tel 301.402.1808

olearyna avatar Sep 20 '22 15:09 olearyna

Thanks. Yes its working perfectly again :)

chriswyatt1 avatar Sep 20 '22 17:09 chriswyatt1

Hi, thank you for this tool... perhaps those glitches have returned?

./datasets download genome taxon "Bacilli" --reference --exclude-gff3 --exclude-genomic-cds --exclude-protein --exclude-rna --dehydrated
Collecting 1,936 genome accessions [================================================] 100% 1936/1936
Downloading: ncbi_dataset.zip    1.05MB 49.1kB/s
Error: Internal error (invalid zip archive). Please try again

I've also had issues not passing --dehydrated and the tool attempting to download what seems much larger packages than that the UI would indicate, to the point where it times out leaving a corrupt zip.

tshauck avatar Oct 05 '22 14:10 tshauck

Hi Trent,

Thanks for your feedback. I was unable to reproduce the problem. It's possible that you may have encountered a temporary network glitch. We are going to look at our services to see if there are any issues.

About your second issue:

I've also had issues not passing --dehydrated and the tool attempting to download what seems much larger packages than that the UI would indicate, to the point where it times out leaving a corrupt zip.

Would you mind sharing an example?

Best, Eric

Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI [email protected]

ericcox1 avatar Oct 05 '22 19:10 ericcox1