segment-anything Downloading the full dataset

has anyone tried downloading the full dataset? https://ai.facebook.com/datasets/segment-anything-downloads/ Any idea how big the data is?

I started yesterday and only have completed about 10 files so far... File sizes are not the same, there's no way of estimating the total size

Apr 06 '23 19:04 huey2531

import os
import requests
from tqdm import tqdm
import pandas as pd
import boto3

def download_file(url, headers, output_filename):
    response = requests.get(url, headers=headers, stream=True)
    response.raise_for_status()
    total_size = int(response.headers.get('content-length', 0))
    progress_bar = tqdm(total=total_size, unit='iB', unit_scale=True)
    try:
        with open(output_filename, 'wb') as file:
            for chunk in response.iter_content(chunk_size=8192):
                file.write(chunk)
                progress_bar.update(len(chunk))
        return True
    except Exception as e:
        print(e)
    return None


def upload_file_to_s3(file_path, bucket, object_key):
    s3 = boto3.client('s3')
    with open(file_path, 'rb') as file:
        s3.upload_fileobj(
            file,
            bucket,
            object_key,
            ExtraArgs={
                'StorageClass': 'STANDARD_IA'  # SIA is called "STANDARD_IA" (Standard Infrequent Access) in boto3                                                                                                 
            }
        )
if __name__ == "__main__":

    headers = {
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
        "accept-encoding": "gzip, deflate, br",
        "accept-language": "en-US,en;q=0.7",
        "cache-control": "no-cache",
        "pragma": "no-cache",
        "sec-fetch-dest": "document",
        "sec-fetch-mode": "navigate",
        "sec-fetch-site": "none",
        "sec-fetch-user": "?1",
        "sec-gpc": "1",
        "upgrade-insecure-requests": "1",
        "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
    }

    df = pd.read_csv("urls", sep="\t")
    for fname, url in zip(df["file_name"], df["cdn_link"]):
        res = download_file(url=url, headers=headers, output_filename=fname)
        if res:
            upload_file_to_s3(file_path=fname, bucket=YOUR_BUCKET, object_key=f"SA-1B/{fname}")
            os.remove(fname)

courtesy of GPT4

Apr 06 '23 19:04 we-d3vs

Thanks! Do mind sharing your prompt? Want to customize it

Apr 06 '23 19:04 huey2531

Any idea how big the data is?

The total size of all tar archives is 11,298,949,953,923 byte (about 11.3 TB).

You can query the size using HTTP HEAD requests without downloading the file:

import requests

# File with URLs from https://ai.facebook.com/datasets/segment-anything-downloads/
urls = "An8MNcSV8eixKBYJ2kyw6sfPh-J9U4tH2BV7uPzibNa0pu4uHi6fyXdlbADVO4nfvsWpTwR8B0usCARHTz33cBQNrC0kWZsD1MbBWjw.txt"

with open(urls) as f:
    lines = f.read().splitlines()[1:]

total_size = 0

for line in lines:
    filename, url = line.split()

    # Make a HEAD request to get the file size
    with requests.head(url) as r:
        size = int(r.headers['Content-Length'])

    total_size += size

    print(f"{filename} size: {size} total: {total_size}")

Apr 06 '23 20:04 99991

Have less than 40gb downloaded so far. This will probably take a month or two

Would be nice to download in parallel Will try splitting the url list and run more instances

Apr 06 '23 22:04 huey2531

I have another way with one line shell code:

cat ~/fb-sam.txt | parallel -j 5  --colsep $'\t'  wget -nc  -c  {2}    -O {1}

fb-sam.txt is a file copy from dataset txt which skip the header line.

Apr 07 '23 01:04 drunkpig

Has anyone downloaded the full dataset? Can you share the checksum for validation? md5sum *.tar > checklist.chk

Apr 15 '23 17:04 VicaYang

Has anyone downloaded the full dataset? Can you share the checksum for validation? md5sum *.tar > checklist.chk

I'm doing it. I will come back and paste my result when it's done. Could you please share your checksum with me?

Update: checklist.txt

Apr 16 '23 06:04 ElectronicElephant

@ElectronicElephant Sure, my download will finish in 30 min, and then I will calculate the checksum.

Here is the checklist.txt, which is consistent with the version provided by @ElectronicElephant.

Apr 16 '23 06:04 VicaYang

@VicaYang Hi, I have checked the list and found your tar file 7 and 17 are not correct. (I re-downloaded these two files and checked their MD5). Others are good.

Apr 16 '23 16:04 ElectronicElephant

@ElectronicElephant Yes, I redownloaded the two files and got the same results. Thank you so much. I also updated my checklist.txt

Apr 16 '23 18:04 VicaYang