datasets
datasets copied to clipboard
Error: Internal error (invalid zip archive). Please try again
Hi,
I'm trying to download protein sequences of all Vertebrates using this command. Its running on Redhat 8.3.
./datasets download genome taxon Vertebrates --exclude-rna --exclude-seq --exclude-gff3 --exclude-genomic-cds --filename Vertebrates.zip
It starts off downloading okay, but everytime after downloading till the filesize of 5.2GB, it fails and gives this error.
Why does this keep happening everytime? I know there is enough space in my folder, so it's not running out of space. Any help would be appreciated.
I have one more question: I want to download all the protein sequences of the Vertebrates in one file. the current command I'm using divides it into several folders, with each protein.faa file, and I concatenate all of them, to make one protein sequence file. Is there a way to download them all into one file or folder?
Thank you
Hi Rittika,
Thanks for your feedback.
For such a large amount of data, I would recommend downloading a dehydrated package then rehydrating it. For more details, see: How to download large genome data packages.
You can get proteins for all annotated vertebrate genomes in 3 steps:
- Download a dehydrated package (note that I have added the
--annotated
flag because protein sequences are only available for annotated genomes)datasets download genome taxon Vertebrates --exclude-rna --exclude-seq --exclude-gff3 --exclude-genomic-cds --annotated --dehydrated --filename vertebrate-proteins.zip
- Unzip the downloaded zip archive
unzip vertebrate-proteins.zip -d vertebrate-proteins/
- Rehydrate to get all the data
datasets rehydrate --directory vertebrate-proteins/
Note that there is no way to download all the protein sequences as a single file--you'll have to download and then concatenate to get the desired result. We are looking into adding this feature in the future.
Best, Eric
Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI [email protected]
Thank you Eric, this worked, and I could download the datasets. It would be very helpful to roll out the feature to download all files as one
Hi there,
I am having the same issue, with the error "Error: Internal error (invalid zip archive). Please try again"
However, I am attempting to download much smaller datasets - just ortholog downloads.
This happens even for the example commands:
datasets download ortholog gene-id 59272
The ortholog summary function appears to work fine. Any help would be appreciated.
Thanks,
Erica
Hi Erica,
Thanks for your feedback and sorry to hear that you're having trouble with the download.
I tried running the command datasets download ortholog gene-id 59272
a few times from my home computer and I was unable to reproduce the problem. You may have encountered a transient network issue.
Best, Eric
Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI [email protected]
Hi,
The same issue happened when I downloaded all SARS-CoV-2 GenBank genomes.
I noticed that the datasets download virus genome
command does not have the dehydrated/rehydrate option. Do you know why this is happening and how to fix it?
Thanks, Yexiao
Thanks for making this great tool, super helpful.
I also have this issue. Is there a way to know when there is a network error?
I am running the example dataset (using the current version datasets version 13.40.0
):
datasets download genome accession GCF_000001405.40 --dehydrated --exclude-rna --exclude-genomic-cds
Downloading: ncbi_dataset.zip 2.15kB 14.4MB/s
Error: Internal error (invalid zip archive). Please try again
Hi, Thanks for contacting us and sorry to hear you got an error. We suspect its related to some technical glitches earlier today. It should be working now. Let us know if you are still getting errors.
Nuala
Nuala A. O'Leary, PhD Product Owner, NCBI Datasets National Center for Biotechnology Information, NLM, NIH, DHHS [Contractor] Building 45 Room 6As.41 Bethesda, MD 20892 tel 301.402.1808
Thanks. Yes its working perfectly again :)
Hi, thank you for this tool... perhaps those glitches have returned?
./datasets download genome taxon "Bacilli" --reference --exclude-gff3 --exclude-genomic-cds --exclude-protein --exclude-rna --dehydrated
Collecting 1,936 genome accessions [================================================] 100% 1936/1936
Downloading: ncbi_dataset.zip 1.05MB 49.1kB/s
Error: Internal error (invalid zip archive). Please try again
I've also had issues not passing --dehydrated
and the tool attempting to download what seems much larger packages than that the UI would indicate, to the point where it times out leaving a corrupt zip.
Hi Trent,
Thanks for your feedback. I was unable to reproduce the problem. It's possible that you may have encountered a temporary network glitch. We are going to look at our services to see if there are any issues.
About your second issue:
I've also had issues not passing --dehydrated and the tool attempting to download what seems much larger packages than that the UI would indicate, to the point where it times out leaving a corrupt zip.
Would you mind sharing an example?
Best, Eric
Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI [email protected]