segment-anything icon indicating copy to clipboard operation
segment-anything copied to clipboard

Downloading the full dataset

Open huey2531 opened this issue 2 years ago • 10 comments

has anyone tried downloading the full dataset? https://ai.facebook.com/datasets/segment-anything-downloads/ Any idea how big the data is?

I started yesterday and only have completed about 10 files so far... File sizes are not the same, there's no way of estimating the total size

huey2531 avatar Apr 06 '23 19:04 huey2531

import os
import requests
from tqdm import tqdm
import pandas as pd
import boto3

def download_file(url, headers, output_filename):
    response = requests.get(url, headers=headers, stream=True)
    response.raise_for_status()
    total_size = int(response.headers.get('content-length', 0))
    progress_bar = tqdm(total=total_size, unit='iB', unit_scale=True)
    try:
        with open(output_filename, 'wb') as file:
            for chunk in response.iter_content(chunk_size=8192):
                file.write(chunk)
                progress_bar.update(len(chunk))
        return True
    except Exception as e:
        print(e)
    return None


def upload_file_to_s3(file_path, bucket, object_key):
    s3 = boto3.client('s3')
    with open(file_path, 'rb') as file:
        s3.upload_fileobj(
            file,
            bucket,
            object_key,
            ExtraArgs={
                'StorageClass': 'STANDARD_IA'  # SIA is called "STANDARD_IA" (Standard Infrequent Access) in boto3                                                                                                 
            }
        )
if __name__ == "__main__":

    headers = {
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
        "accept-encoding": "gzip, deflate, br",
        "accept-language": "en-US,en;q=0.7",
        "cache-control": "no-cache",
        "pragma": "no-cache",
        "sec-fetch-dest": "document",
        "sec-fetch-mode": "navigate",
        "sec-fetch-site": "none",
        "sec-fetch-user": "?1",
        "sec-gpc": "1",
        "upgrade-insecure-requests": "1",
        "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
    }

    df = pd.read_csv("urls", sep="\t")
    for fname, url in zip(df["file_name"], df["cdn_link"]):
        res = download_file(url=url, headers=headers, output_filename=fname)
        if res:
            upload_file_to_s3(file_path=fname, bucket=YOUR_BUCKET, object_key=f"SA-1B/{fname}")
            os.remove(fname)

courtesy of GPT4

we-d3vs avatar Apr 06 '23 19:04 we-d3vs

Thanks! Do mind sharing your prompt? Want to customize it

huey2531 avatar Apr 06 '23 19:04 huey2531

Any idea how big the data is?

The total size of all tar archives is 11,298,949,953,923 byte (about 11.3 TB).

You can query the size using HTTP HEAD requests without downloading the file:

import requests

# File with URLs from https://ai.facebook.com/datasets/segment-anything-downloads/
urls = "An8MNcSV8eixKBYJ2kyw6sfPh-J9U4tH2BV7uPzibNa0pu4uHi6fyXdlbADVO4nfvsWpTwR8B0usCARHTz33cBQNrC0kWZsD1MbBWjw.txt"

with open(urls) as f:
    lines = f.read().splitlines()[1:]

total_size = 0

for line in lines:
    filename, url = line.split()

    # Make a HEAD request to get the file size
    with requests.head(url) as r:
        size = int(r.headers['Content-Length'])

    total_size += size

    print(f"{filename} size: {size} total: {total_size}")

99991 avatar Apr 06 '23 20:04 99991

Have less than 40gb downloaded so far. This will probably take a month or two

Would be nice to download in parallel Will try splitting the url list and run more instances

huey2531 avatar Apr 06 '23 22:04 huey2531

I have another way with one line shell code:

cat ~/fb-sam.txt | parallel -j 5  --colsep $'\t'  wget -nc  -c  {2}    -O {1}

fb-sam.txt is a file copy from dataset txt which skip the header line.

drunkpig avatar Apr 07 '23 01:04 drunkpig

Has anyone downloaded the full dataset? Can you share the checksum for validation? md5sum *.tar > checklist.chk

VicaYang avatar Apr 15 '23 17:04 VicaYang

Has anyone downloaded the full dataset? Can you share the checksum for validation? md5sum *.tar > checklist.chk

I'm doing it. I will come back and paste my result when it's done. Could you please share your checksum with me?

Update: checklist.txt

ElectronicElephant avatar Apr 16 '23 06:04 ElectronicElephant

@ElectronicElephant Sure, my download will finish in 30 min, and then I will calculate the checksum.

Here is the checklist.txt, which is consistent with the version provided by @ElectronicElephant.

VicaYang avatar Apr 16 '23 06:04 VicaYang

@VicaYang Hi, I have checked the list and found your tar file 7 and 17 are not correct. (I re-downloaded these two files and checked their MD5). Others are good.

ElectronicElephant avatar Apr 16 '23 16:04 ElectronicElephant

@ElectronicElephant Yes, I redownloaded the two files and got the same results. Thank you so much. I also updated my checklist.txt

VicaYang avatar Apr 16 '23 18:04 VicaYang