Downloading the full dataset
has anyone tried downloading the full dataset? https://ai.facebook.com/datasets/segment-anything-downloads/ Any idea how big the data is?
I started yesterday and only have completed about 10 files so far... File sizes are not the same, there's no way of estimating the total size
import os
import requests
from tqdm import tqdm
import pandas as pd
import boto3
def download_file(url, headers, output_filename):
response = requests.get(url, headers=headers, stream=True)
response.raise_for_status()
total_size = int(response.headers.get('content-length', 0))
progress_bar = tqdm(total=total_size, unit='iB', unit_scale=True)
try:
with open(output_filename, 'wb') as file:
for chunk in response.iter_content(chunk_size=8192):
file.write(chunk)
progress_bar.update(len(chunk))
return True
except Exception as e:
print(e)
return None
def upload_file_to_s3(file_path, bucket, object_key):
s3 = boto3.client('s3')
with open(file_path, 'rb') as file:
s3.upload_fileobj(
file,
bucket,
object_key,
ExtraArgs={
'StorageClass': 'STANDARD_IA' # SIA is called "STANDARD_IA" (Standard Infrequent Access) in boto3
}
)
if __name__ == "__main__":
headers = {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q=0.7",
"cache-control": "no-cache",
"pragma": "no-cache",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "none",
"sec-fetch-user": "?1",
"sec-gpc": "1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
}
df = pd.read_csv("urls", sep="\t")
for fname, url in zip(df["file_name"], df["cdn_link"]):
res = download_file(url=url, headers=headers, output_filename=fname)
if res:
upload_file_to_s3(file_path=fname, bucket=YOUR_BUCKET, object_key=f"SA-1B/{fname}")
os.remove(fname)
courtesy of GPT4
Thanks! Do mind sharing your prompt? Want to customize it
Any idea how big the data is?
The total size of all tar archives is 11,298,949,953,923 byte (about 11.3 TB).
You can query the size using HTTP HEAD requests without downloading the file:
import requests
# File with URLs from https://ai.facebook.com/datasets/segment-anything-downloads/
urls = "An8MNcSV8eixKBYJ2kyw6sfPh-J9U4tH2BV7uPzibNa0pu4uHi6fyXdlbADVO4nfvsWpTwR8B0usCARHTz33cBQNrC0kWZsD1MbBWjw.txt"
with open(urls) as f:
lines = f.read().splitlines()[1:]
total_size = 0
for line in lines:
filename, url = line.split()
# Make a HEAD request to get the file size
with requests.head(url) as r:
size = int(r.headers['Content-Length'])
total_size += size
print(f"{filename} size: {size} total: {total_size}")
Have less than 40gb downloaded so far. This will probably take a month or two
Would be nice to download in parallel Will try splitting the url list and run more instances
I have another way with one line shell code:
cat ~/fb-sam.txt | parallel -j 5 --colsep $'\t' wget -nc -c {2} -O {1}
fb-sam.txt is a file copy from dataset txt
which skip the header line.
Has anyone downloaded the full dataset? Can you share the checksum for validation? md5sum *.tar > checklist.chk
Has anyone downloaded the full dataset? Can you share the checksum for validation?
md5sum *.tar > checklist.chk
I'm doing it. I will come back and paste my result when it's done. Could you please share your checksum with me?
Update: checklist.txt
@ElectronicElephant Sure, my download will finish in 30 min, and then I will calculate the checksum.
Here is the checklist.txt, which is consistent with the version provided by @ElectronicElephant.
@VicaYang Hi, I have checked the list and found your tar file 7 and 17 are not correct. (I re-downloaded these two files and checked their MD5). Others are good.
@ElectronicElephant Yes, I redownloaded the two files and got the same results. Thank you so much. I also updated my checklist.txt