vision icon indicating copy to clipboard operation
vision copied to clipboard

Celeba dataset can not be downloaded

Open Isaac-Hirsch opened this issue 11 months ago β€’ 3 comments

πŸ› Describe the bug

36 CIFAR10 = datasets.CelebA(root="Datasets", download=True)

File /apps/python/3.10/3.10.9/lib/python3.10/site-packages/torchvision/datasets/celeba.py:80, in CelebA.init(self, root, split, target_type, transform, target_transform, download) 77 raise RuntimeError("target_transform is specified but target_type is empty") 79 if download: ---> 80 self.download() 82 if not self._check_integrity(): 83 raise RuntimeError("Dataset not found or corrupted. You can use download=True to download it")

File /apps/python/3.10/3.10.9/lib/python3.10/site-packages/torchvision/datasets/celeba.py:150, in CelebA.download(self) 147 return 149 for (file_id, md5, filename) in self.file_list: --> 150 download_file_from_google_drive(file_id, os.path.join(self.root, self.base_folder), filename, md5) 152 extract_archive(os.path.join(self.root, self.base_folder, "img_align_celeba.zip"))

File /apps/python/3.10/3.10.9/lib/python3.10/site-packages/torchvision/datasets/utils.py:280, in download_file_from_google_drive(file_id, root, filename, md5) 272 warnings.warn( 273 f"We detected some HTML elements in the downloaded file. " 274 f"This most likely means that the download triggered an unhandled API response by GDrive. " 275 f"Please report this to torchvision at https://github.com/pytorch/vision/issues including " 276 f"the response:\n\n{text}" 277 ) 279 if md5 and not check_md5(fpath, md5): --> 280 raise RuntimeError( 281 f"The MD5 checksum of the download file {fpath} does not match the one on record." 282 f"Please delete the file and try again. " 283 f"If the issue persists, please report this to torchvision at https://github.com/pytorch/vision/issues." 284 )

RuntimeError: The MD5 checksum of the download file Datasets/celeba/img_align_celeba.zip does not match the one on record.Please delete the file and try again. If the issue persists, please report this to torchvision at https://github.com/pytorch/vision/issues.

Versions

Collecting environment information... PyTorch version: 1.12.1+cu113 Is debug build: False CUDA used to build PyTorch: 11.3 ROCM used to build PyTorch: N/A

OS: Red Hat Enterprise Linux release 8.10 (Ootpa) (x86_64) GCC version: (GCC) 8.5.0 20210514 (Red Hat 8.5.0-22) Clang version: 17.0.6 (Red Hat 17.0.6-1.module+el8.10.0+20808+e12784c0) CMake version: version 3.26.5 Libc version: glibc-2.28

Python version: 3.10.9 (main, Feb 22 2023, 19:43:33) [GCC 8.5.0 20210514 (Red Hat 8.5.0-16)] (64-bit runtime) Python platform: Linux-4.18.0-553.8.1.el8_10.x86_64-x86_64-with-glibc2.28 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 1 Core(s) per socket: 32 Socket(s): 2 NUMA node(s): 2 Vendor ID: AuthenticAMD CPU family: 25 Model: 17 Model name: AMD EPYC 9334 32-Core Processor Stepping: 1 CPU MHz: 3894.037 CPU max MHz: 3910.2529 CPU min MHz: 1500.0000 BogoMIPS: 5391.75 Virtualization: AMD-V L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 32768K NUMA node0 CPU(s): 0-31 NUMA node1 CPU(s): 32-63 Flags: (removed)

Versions of relevant libraries: [pip3] flake8==6.0.0 [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.26.2 [pip3] nvidia-cublas-cu11==11.10.3.66 [pip3] nvidia-cuda-nvrtc-cu11==11.7.99 [pip3] nvidia-cuda-runtime-cu11==11.7.99 [pip3] nvidia-cudnn-cu11==8.5.0.96 [pip3] torch==1.12.1+cu113 [pip3] torchaudio==0.12.1+cu113 [pip3] torchvision==0.13.1+cu113 [conda] Could not collect

Isaac-Hirsch avatar Jan 23 '25 04:01 Isaac-Hirsch

I also encountered the same issue. Does anyone know when it will be fixed?

fjxmlzn avatar Feb 11 '25 09:02 fjxmlzn

A workaround can be:

  1. manually downloading from the following links (because gdrive links from the official source aren't working for me for some reason):
cd <data_dir>
mkdir celeba && cd celeba
wget https://cseweb.ucsd.edu/~weijian/static/datasets/celeba/img_align_celeba.zip
unzip img_align_celeba.zip
rm img_align_celeba.zip
wget https://raw.githubusercontent.com/KaiserW/bald-recognition/refs/heads/master/dataset/celeba/list_eval_partition.txt
wget https://raw.githubusercontent.com/KaiserW/bald-recognition/refs/heads/master/dataset/celeba/list_attr_celeba.txt
wget https://raw.githubusercontent.com/KaiserW/bald-recognition/refs/heads/master/dataset/celeba/identity_CelebA.txt
  1. overriding the _check_integrity method of CelebA (or alternatively update the checksum for these files in the child class iff sure of their source):
import os
from torchvision import datasets

class WorkingCelebA(datasets.CelebA):
    # copied from https://pytorch.org/vision/main/_modules/torchvision/datasets/celeba.html#CelebA
    def _check_integrity(self) -> bool:
        for (_, md5, filename) in self.file_list:
            fpath = os.path.join(self.root, self.base_folder, filename)
            _, ext = os.path.splitext(filename)
            # Allow original archive to be deleted (zip and 7z)
            # Only need the extracted images
            if ext in [".zip", ".7z"]:
                continue
            if not datasets.utils.check_integrity(fpath, md5):
                # only printing, instead of returning False
                print("Failed to check integrity of", fpath)
                # return False

        # Should check a hash of the images
        return os.path.isdir(os.path.join(self.root, self.base_folder, "img_align_celeba"))
  1. constructing dataset as usual with this class.

hh10 avatar Mar 28 '25 12:03 hh10

Hey @Isaac-Hirsch, Thanks for reporting the issue, and sorry for the delayed response.

I’m having trouble reproducing the problem. Please find below what I get using torchvision version 0.21.0+cu124 and torch version 2.6.0+cu124.

I also verified the MD5 checksums for all the ZIP files and compared them against the values in the files_list from the CelebA classβ€”they do match.

Are you still encountering the issue on your end?

CIFAR10 = torchvision.datasets.CelebA(root="Datasets", download=True)
Downloading...
From: https://drive.google.com/uc?id=0B7EVK8r0v71pZjFTYXZWM3FlRnM
To: /content/Datasets/celeba/img_align_celeba.zip
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1.44G/1.44G [00:18<00:00, 79.6MB/s]
Downloading...
From: https://drive.google.com/uc?id=0B7EVK8r0v71pblRyaVFSWGxPY0U
To: /content/Datasets/celeba/list_attr_celeba.txt
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 26.7M/26.7M [00:00<00:00, 49.9MB/s]
Downloading...
From: https://drive.google.com/uc?id=1_ee_0u7vcNLOfNLegJRHmolfH5ICW-XS
To: /content/Datasets/celeba/identity_CelebA.txt
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3.42M/3.42M [00:00<00:00, 37.4MB/s]
Downloading...
From: https://drive.google.com/uc?id=0B7EVK8r0v71pbThiMVRxWXZ4dU0
To: /content/Datasets/celeba/list_bbox_celeba.txt
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 6.08M/6.08M [00:00<00:00, 126MB/s]
Downloading...
From: https://drive.google.com/uc?id=0B7EVK8r0v71pd0FJY3Blby1HUTQ
To: /content/Datasets/celeba/list_landmarks_align_celeba.txt
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 12.2M/12.2M [00:00<00:00, 50.1MB/s]
Downloading...
From: https://drive.google.com/uc?id=0B7EVK8r0v71pY0NSMzRuSXJEVkk
To: /content/Datasets/celeba/list_eval_partition.txt
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2.84M/2.84M [00:00<00:00, 163MB/s]

AntoineSimoulin avatar May 21 '25 19:05 AntoineSimoulin