segment-anything
segment-anything copied to clipboard
How to effectively download the SA-1B dataset
Thank you very much for such an outstanding contribution. I tried to download the dataset on the official website, and the link in the download text provided could not be downloaded with wget, and it was very time-consuming if I tried to download one by one. How can I download the dataset efficiently?
Hi, My 2 cents about downloading the dataset :
It works for me in wget, but only if I launch the wget command directly in the terminal. If I launch it from a script file, it gets 403 forbidden errors. I guess there are some protection against batch download
The solution that worked for me was to combine them wget calls with &&
to make a queue.
wget link1 -O tar_name1 && wget link2 -O tar_name2 && ...
I could not concatenate all the 1000 comands at once, but at least I can divide the number of time I need to manually launch a command.
Some insights about the dataset (not fully downloaded yet) :
- 1000 tar file to download, each seems to be 11Gb. That's 10Tb of data in total, probably 12Tb uncompressed
- Fortunately, each tar file is standalone, with approx 10k image per tar file, with one json file per image.
@ClementPinard My single wget command also gets 403. Do you have any idea?
Same problem. I got 403 with single command line wget.
Hi @youkaichao @Phoveran, you need to enclose the link with double quotes.
For me, just copy and paste all their provided file names and links to a txt file for example 'links,txt'. Then in the terminal, run
while read file_name cdn_link; do wget -O "$file_name" "$cdn_link"; done < links.txt
This works on my side. Hope can help you guys.
Thank you very much, I will try this method
Or you could download all of them by chrono, an extension in google-explore
How to use the files downloaded
Building on top of @peiwang062 's script, using aria2 supports multi-threaded downloading and resuming.
Remove the header line of the links file provided by meta, then run:
while read file_name cdn_link; do aria2c -x4 -c -o "$file_name" "$cdn_link"; done < file_list.txt
This will download each file using 4 threads/streams while supporting resume.
@tseven that works great! Do you have a recommendation for extracting them also? Preferably throwing away the tar after each extraction because I do not have more than 15TB of disk space.
@tommiekerssies you could do something like this:
for file in *.tar; do tar xvf "${file}" && rm "${file}"; done
This will extract all the tar files while deleting the file after successful extraction.
@tseven thank you!