segment-anything How to effectively download the SA-1B dataset

Thank you very much for such an outstanding contribution. I tried to download the dataset on the official website, and the link in the download text provided could not be downloaded with wget, and it was very time-consuming if I tried to download one by one. How can I download the dataset efficiently?

Apr 06 '23 03:04 CauchyFanUpdate

Hi, My 2 cents about downloading the dataset :

It works for me in wget, but only if I launch the wget command directly in the terminal. If I launch it from a script file, it gets 403 forbidden errors. I guess there are some protection against batch download

The solution that worked for me was to combine them wget calls with && to make a queue.

wget link1 -O tar_name1 && wget link2 -O tar_name2 && ...

I could not concatenate all the 1000 comands at once, but at least I can divide the number of time I need to manually launch a command.

Some insights about the dataset (not fully downloaded yet) :

1000 tar file to download, each seems to be 11Gb. That's 10Tb of data in total, probably 12Tb uncompressed
Fortunately, each tar file is standalone, with approx 10k image per tar file, with one json file per image.

Apr 06 '23 15:04 ClementPinard

@ClementPinard My single wget command also gets 403. Do you have any idea?

Apr 06 '23 16:04 youkaichao

Same problem. I got 403 with single command line wget.

Apr 06 '23 17:04 Phoveran

Hi @youkaichao @Phoveran, you need to enclose the link with double quotes.

Apr 08 '23 07:04 fredzzhang

For me, just copy and paste all their provided file names and links to a txt file for example 'links,txt'. Then in the terminal, run while read file_name cdn_link; do wget -O "$file_name" "$cdn_link"; done < links.txt This works on my side. Hope can help you guys.

Apr 11 '23 18:04 peiwang062

Thank you very much, I will try this method

Apr 12 '23 01:04 CauchyFanUpdate

Or you could download all of them by chrono, an extension in google-explore

Apr 16 '23 08:04 beefsoup18

How to use the files downloaded

May 02 '23 13:05 liuxy1103

Building on top of @peiwang062 's script, using aria2 supports multi-threaded downloading and resuming. Remove the header line of the links file provided by meta, then run: while read file_name cdn_link; do aria2c -x4 -c -o "$file_name" "$cdn_link"; done < file_list.txt

This will download each file using 4 threads/streams while supporting resume.

Jun 02 '23 21:06 tseven

@tseven that works great! Do you have a recommendation for extracting them also? Preferably throwing away the tar after each extraction because I do not have more than 15TB of disk space.

Aug 30 '23 08:08 tommiekerssies

@tommiekerssies you could do something like this: for file in *.tar; do tar xvf "${file}" && rm "${file}"; done

This will extract all the tar files while deleting the file after successful extraction.

Sep 02 '23 19:09 tseven

@tseven thank you!

Sep 03 '23 15:09 tommiekerssies

segment-anything segment-anything copied to clipboard

How to effectively download the SA-1B dataset

segment-anything
segment-anything copied to clipboard