ffhq-dataset icon indicating copy to clipboard operation
ffhq-dataset copied to clipboard

Using pydrive with user credentials for authenticated download

Open jeremyfix opened this issue 3 years ago • 2 comments

Unfortunately, when using your code, an anonymous download is performed and I tried several consecutive days, I always got an exceeded quota error making me unable to download the dataset.

This pull requests, which uses code adapted from the FFHQ-Aging repo is using user credentials for downloading the dataset.

The only requirement is to follow the pydrive quickstart for getting the client_secrets.json file placed in the same directory than download_ffhq.py and you can then indicate you want to use pydrive google authentication by appending the --pydrive command line option.

So for example, for downloading the 1024x1024 images, you simply :

python3 download_ffhq.py -i --pydrive

In the code, several attempts are tried to download a file. Without that code, inspired by yours, I got some httplib2.error.ServerNotFoundError: Unable to find the server at www.googleapis.com being raised. Apparently, retrying the download a second time and the exception is not raised.

I only tested the download of the images (the command line above) but as the other downloads go through the download_files function, I hope it works as well for the other downloads.

jeremyfix avatar Apr 21 '21 17:04 jeremyfix

Note that, for some reasons, after some times (like hours), it may try to reauthenticate and it ends as a failure but relaunching the script and it continues downloading;

I successfully downloaded the 90 GB of the 1024x1024 images this way.

jeremyfix avatar Apr 22 '21 17:04 jeremyfix

This was very helpful for me. I was able to download the 89GB of 1024x1024 images with a restart after a few hours. As an additional step, I had to replace

# Google Drive virus checker nag.
links = [html.unescape(link) for link in data_str.split('"') if 'export=download' in link]
if len(links) == 1:
    if attempts_left:
        file_url = requests.compat.urljoin(file_url, links[0])
        continue

with

# Google Drive virus checker nag.
file_id = re.findall('uc\?id=(.*)&amp', data_str)
if len(file_id) == 1:
    file_id = file_id[0]
    if attempts_left:
        file_url = 'https://www.googleapis.com/drive/v3/files/{}/?key=API_KEY&alt=media'.format(file_id)
        continue

This is because the virus checker page changed, so the code for handling it doesn't work anymore. To make this work, I had to follow the instructions in the pydrive quickstart link given above (i.e., use this PR and get a client_secrets.json from the Drive API). The new virus checker workaround uses an API key that you can create in a GCP API project, similar to how you get the client_secrets.json file. You can also use the OAuth key.

I had to run the download script with the --cmd_auth flag and use a "Desktop" instead of "Web application" setting in the Drive API to make it work. Here is a screenshot of my Drive API page. Screenshot from 2022-04-12 18-39-19

mmazeika avatar Apr 12 '22 23:04 mmazeika