segment-anything icon indicating copy to clipboard operation
segment-anything copied to clipboard

How to effectively download the SA-1B dataset

Open CauchyFanUpdate opened this issue 1 year ago • 8 comments

Thank you very much for such an outstanding contribution. I tried to download the dataset on the official website, and the link in the download text provided could not be downloaded with wget, and it was very time-consuming if I tried to download one by one. How can I download the dataset efficiently?

CauchyFanUpdate avatar Apr 06 '23 03:04 CauchyFanUpdate

Hi, My 2 cents about downloading the dataset :

It works for me in wget, but only if I launch the wget command directly in the terminal. If I launch it from a script file, it gets 403 forbidden errors. I guess there are some protection against batch download

The solution that worked for me was to combine them wget calls with && to make a queue.

wget link1 -O tar_name1 && wget link2 -O tar_name2 && ...

I could not concatenate all the 1000 comands at once, but at least I can divide the number of time I need to manually launch a command.

Some insights about the dataset (not fully downloaded yet) :

  • 1000 tar file to download, each seems to be 11Gb. That's 10Tb of data in total, probably 12Tb uncompressed
  • Fortunately, each tar file is standalone, with approx 10k image per tar file, with one json file per image.

ClementPinard avatar Apr 06 '23 15:04 ClementPinard

@ClementPinard My single wget command also gets 403. Do you have any idea?

youkaichao avatar Apr 06 '23 16:04 youkaichao

Same problem. I got 403 with single command line wget.

Phoveran avatar Apr 06 '23 17:04 Phoveran

Hi @youkaichao @Phoveran, you need to enclose the link with double quotes.

fredzzhang avatar Apr 08 '23 07:04 fredzzhang

For me, just copy and paste all their provided file names and links to a txt file for example 'links,txt'. Then in the terminal, run while read file_name cdn_link; do wget -O "$file_name" "$cdn_link"; done < links.txt This works on my side. Hope can help you guys.

peiwang062 avatar Apr 11 '23 18:04 peiwang062

Thank you very much, I will try this method

CauchyFanUpdate avatar Apr 12 '23 01:04 CauchyFanUpdate

Or you could download all of them by chrono, an extension in google-explore

beefsoup18 avatar Apr 16 '23 08:04 beefsoup18

How to use the files downloaded

liuxy1103 avatar May 02 '23 13:05 liuxy1103

Building on top of @peiwang062 's script, using aria2 supports multi-threaded downloading and resuming. Remove the header line of the links file provided by meta, then run: while read file_name cdn_link; do aria2c -x4 -c -o "$file_name" "$cdn_link"; done < file_list.txt

This will download each file using 4 threads/streams while supporting resume.

tseven avatar Jun 02 '23 21:06 tseven

@tseven that works great! Do you have a recommendation for extracting them also? Preferably throwing away the tar after each extraction because I do not have more than 15TB of disk space.

tommiekerssies avatar Aug 30 '23 08:08 tommiekerssies

@tommiekerssies you could do something like this: for file in *.tar; do tar xvf "${file}" && rm "${file}"; done

This will extract all the tar files while deleting the file after successful extraction.

tseven avatar Sep 02 '23 19:09 tseven

@tseven thank you!

tommiekerssies avatar Sep 03 '23 15:09 tommiekerssies