co3d CDN link expired

Thanks for releasing this useful dataset. I was trying to download the data following the CDN links found in the text file, but for the URLs I get "URL signature expired" error from any browser and any machine I try it from. How do I solve this?

Sep 05 '21 15:09 crysoberil

+1

There is also a ZeroDivisionError in download_dataset.py (line 138) if the download fails like this.

Sep 05 '21 22:09 pwais

Since I have not heard back from anyone, I wrote a script that uses selenium to populate the downloads file from the webpage.

import argparse
import time
import requests
import selenium.webdriver.firefox.options
from selenium import webdriver


CO3D_WEBPAEGE_URL = "https://ai.facebook.com/datasets/co3d-downloads/"


def fetch_url_by_span_text(driver, query_text):
    text_elem = driver.find_element_by_xpath("//span[contains(text(),'{}')]".format(query_text))
    a_elm = text_elem.find_element_by_xpath("..")
    url = a_elm.get_attribute("href")
    return url


def get_category_ids(driver):
    cur_list_url = fetch_url_by_span_text(driver, "Download all links")
    response = requests.get(cur_list_url)
    data = response.text
    lines = data.split('\n')[1:]
    category_ids = [elm.split()[0].strip() for elm in lines]
    return category_ids


def get_co3d_urls(page_path):
    options = selenium.webdriver.firefox.options.Options()
    options.headless = True
    firefox_profile = webdriver.FirefoxProfile()
    firefox_profile.set_preference("browser.privatebrowsing.autostart", True)
    with webdriver.Firefox(options=options, firefox_profile=firefox_profile) as driver:
        driver.get(page_path)
        time.sleep(1)  # Some delay to let the webpage populate
        category_ids = get_category_ids(driver)
        item_path_pairs = []
        for category_id in category_ids:
            url = fetch_url_by_span_text(driver, category_id)
            item_path_pairs.append((category_id, url))
        return item_path_pairs


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--download_files_list", type=str, required=False, help="Where the downloadable list will be generated", default="./downloadpaths.txt")
    args = parser.parse_args()
    co3d_item_urls = get_co3d_urls(CO3D_WEBPAEGE_URL)
    with open(args.download_files_list, 'w') as f_out:
        f_out.write("file_name\tcdn_link\n")
        for i, (item, url) in enumerate(co3d_item_urls):
            f_out.write(item)
            f_out.write('\t')
            f_out.write(url)
            if i < len(co3d_item_urls) - 1:
                f_out.write('\n')

Sep 07 '21 17:09 crysoberil

Thanks for releasing this useful dataset. I was trying to download the data following the CDN links found in the text file, but for the URLs I get "URL signature expired" error from any browser and any machine I try it from. How do I solve this?

Hi, thanks for the interest in our dataset and sorry for being late with the response due to some of us being on summer holiday.

The CDN links expire once every few days and the link text file has to be re-downloaded. Make sure to download a fresh list of links whenever you start the download. This should fix the problem.

Indeed the solution using selenium seems to do the latter automatically. Thanks for the code!

Sep 10 '21 10:09 davnov134

Hi, I tried to use the script to download the dataset with the copied CDN links from your website, but met the ZeroDivisionError also. Do you know what could be the reason for this?

Sep 10 '21 15:09 stalkerrush

Thanks for releasing this useful dataset. I was trying to download the data following the CDN links found in the text file, but for the URLs I get "URL signature expired" error from any browser and any machine I try it from. How do I solve this?

Hi, thanks for the interest in our dataset and sorry for being late with the response due to some of us being on summer holiday.

The CDN links expire once every few days and the link text file has to be re-downloaded. Make sure to download a fresh list of links whenever you start the download. This should fix the problem.

Indeed the solution using selenium seems to do the latter automatically. Thanks for the code!

I just downloaded the text file and retried the URLs and I still get the "URL signature expired" error. The URLs within the text files are getting expired. The ZeroDivisionError in the python script is also happening because of this. That's why I wrote the script above. It requires selenium to work, but generates a fresh text file which should allow one to download the dataset without facing these errors.

Sep 10 '21 15:09 crysoberil

@davnov134 Happy Summer Holiday! The bugs here are:

The website generating the 51-line text links file appears to be broken. The urls are all expired. Perhaps it's generating a static file and/or there are some caching problems going on. Seems we have multiple reproductions here.
The download script has a ZeroDivisionError bug that triggers when one or more of the files can't be downloaded. Also several reproductions.

Edit: huh it seems the manual download links may have also expired now (i.e. the 50 links on https://ai.facebook.com/datasets/co3d-downloads/ ). I haven't seen that happen before.

Sep 10 '21 16:09 pwais

@davnov134 Happy Summer Holiday! The bugs here are: Edit: huh it seems the manual download links may have also expired now (i.e. the 50 links on https://ai.facebook.com/datasets/co3d-downloads/ ). I haven't seen that happen before.

@pwais The links on the page do work for me still. Perhaps the problem on your end happens due to webpage caching by your browser? Do the links work if you load the page in incognito?

Sep 10 '21 17:09 crysoberil

hmm looks like I had some connection issues. the download script still doesn't work for me tho. the selenium script does help a lot!!

Sep 10 '21 17:09 pwais

@pwais , I just downloaded a fresh link list file and launched the download without issues. If you make sure that you are using a fresh set of links (i.e. do a no-cache refresh of the link page, in Chrome+Mac this is done via Cmd+R), do you still encounter the zero-division error?

The selenium solution is very nice, but introduces too big of a dependency to be supported officially. So I'd rather make sure that the problem cannot be solved in a simpler manner.

Sep 11 '21 12:09 davnov134

Agree that selenium is a heavy requirement, but the links seem to expire from time to time nonetheless. I was able to download using wget --continue which was handy because the downloads did fail a bit from time to time. The ZeroDivision error remains-- when a download has zero bytes (it fails), the cited exception hides the issue of the link being broken-- the division at hand is to inform the progress bar, and the progress bar not working is irrelevant if the file can't be downloaded at all. If the response is zero, perhaps just raise a ValueError ?

@davnov134 The paper says "The CO3D collection effort still continues at a steady pace of ∼500 videos per week which we plan to release in the near future." --- do you intend to version the dataset and/or provide the new videos? I think what most people would want here is an experience similar to rsync or aws s3 sync -- any partial data is not re-downloaded, and new data can get downloaded easily too. (Note that the existing download script always starts from scratch-- that doesn't scale well for a dataset this size... I had to resume multiple times due to network issues, and I never saw better than 50 MByte/sec download). awscli is a healthy multi-platform client but I can understand why Facbeook might not want to depend on that and/or publish under an S3-compatible server-side solution ...

At any rate, thanks for this amazing dataset! I wish there were a more straightforward way for distributing stuff like this, COCO, imagenet, etc...

Sep 11 '21 21:09 pwais

@davnov134 Thanks for the recent fix. I still can't download the whole dataset though :( It will eventually time out, and the download script doesn't allow resumes (it tends to blank out everything downloaded thus far). I have fiber internet so I don't think the problem is my connection is too slow.

Will the dataset be available via one big download (e.g. how imagenet was) or bittorrent or something? The current distribution method doesn't seem to work. Once upon a time Facebook had Wirehog (https://en.wikipedia.org/wiki/Wirehog ) ... maybe they can revive that?
Again, the paper says "The CO3D collection effort still continues at a steady pace of ∼500 videos per week which we plan to release in the near future." --- do you intend to version the dataset and/or provide the new videos?

Nov 09 '21 22:11 pwais