Celeba dataset can not be downloaded
π Describe the bug
36 CIFAR10 = datasets.CelebA(root="Datasets", download=True)
File /apps/python/3.10/3.10.9/lib/python3.10/site-packages/torchvision/datasets/celeba.py:80, in CelebA.init(self, root, split, target_type, transform, target_transform, download) 77 raise RuntimeError("target_transform is specified but target_type is empty") 79 if download: ---> 80 self.download() 82 if not self._check_integrity(): 83 raise RuntimeError("Dataset not found or corrupted. You can use download=True to download it")
File /apps/python/3.10/3.10.9/lib/python3.10/site-packages/torchvision/datasets/celeba.py:150, in CelebA.download(self) 147 return 149 for (file_id, md5, filename) in self.file_list: --> 150 download_file_from_google_drive(file_id, os.path.join(self.root, self.base_folder), filename, md5) 152 extract_archive(os.path.join(self.root, self.base_folder, "img_align_celeba.zip"))
File /apps/python/3.10/3.10.9/lib/python3.10/site-packages/torchvision/datasets/utils.py:280, in download_file_from_google_drive(file_id, root, filename, md5) 272 warnings.warn( 273 f"We detected some HTML elements in the downloaded file. " 274 f"This most likely means that the download triggered an unhandled API response by GDrive. " 275 f"Please report this to torchvision at https://github.com/pytorch/vision/issues including " 276 f"the response:\n\n{text}" 277 ) 279 if md5 and not check_md5(fpath, md5): --> 280 raise RuntimeError( 281 f"The MD5 checksum of the download file {fpath} does not match the one on record." 282 f"Please delete the file and try again. " 283 f"If the issue persists, please report this to torchvision at https://github.com/pytorch/vision/issues." 284 )
RuntimeError: The MD5 checksum of the download file Datasets/celeba/img_align_celeba.zip does not match the one on record.Please delete the file and try again. If the issue persists, please report this to torchvision at https://github.com/pytorch/vision/issues.
Versions
Collecting environment information... PyTorch version: 1.12.1+cu113 Is debug build: False CUDA used to build PyTorch: 11.3 ROCM used to build PyTorch: N/A
OS: Red Hat Enterprise Linux release 8.10 (Ootpa) (x86_64) GCC version: (GCC) 8.5.0 20210514 (Red Hat 8.5.0-22) Clang version: 17.0.6 (Red Hat 17.0.6-1.module+el8.10.0+20808+e12784c0) CMake version: version 3.26.5 Libc version: glibc-2.28
Python version: 3.10.9 (main, Feb 22 2023, 19:43:33) [GCC 8.5.0 20210514 (Red Hat 8.5.0-16)] (64-bit runtime) Python platform: Linux-4.18.0-553.8.1.el8_10.x86_64-x86_64-with-glibc2.28 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 1 Core(s) per socket: 32 Socket(s): 2 NUMA node(s): 2 Vendor ID: AuthenticAMD CPU family: 25 Model: 17 Model name: AMD EPYC 9334 32-Core Processor Stepping: 1 CPU MHz: 3894.037 CPU max MHz: 3910.2529 CPU min MHz: 1500.0000 BogoMIPS: 5391.75 Virtualization: AMD-V L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 32768K NUMA node0 CPU(s): 0-31 NUMA node1 CPU(s): 32-63 Flags: (removed)
Versions of relevant libraries: [pip3] flake8==6.0.0 [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.26.2 [pip3] nvidia-cublas-cu11==11.10.3.66 [pip3] nvidia-cuda-nvrtc-cu11==11.7.99 [pip3] nvidia-cuda-runtime-cu11==11.7.99 [pip3] nvidia-cudnn-cu11==8.5.0.96 [pip3] torch==1.12.1+cu113 [pip3] torchaudio==0.12.1+cu113 [pip3] torchvision==0.13.1+cu113 [conda] Could not collect
I also encountered the same issue. Does anyone know when it will be fixed?
A workaround can be:
- manually downloading from the following links (because gdrive links from the official source aren't working for me for some reason):
cd <data_dir>
mkdir celeba && cd celeba
wget https://cseweb.ucsd.edu/~weijian/static/datasets/celeba/img_align_celeba.zip
unzip img_align_celeba.zip
rm img_align_celeba.zip
wget https://raw.githubusercontent.com/KaiserW/bald-recognition/refs/heads/master/dataset/celeba/list_eval_partition.txt
wget https://raw.githubusercontent.com/KaiserW/bald-recognition/refs/heads/master/dataset/celeba/list_attr_celeba.txt
wget https://raw.githubusercontent.com/KaiserW/bald-recognition/refs/heads/master/dataset/celeba/identity_CelebA.txt
- overriding the _check_integrity method of CelebA (or alternatively update the checksum for these files in the child class iff sure of their source):
import os
from torchvision import datasets
class WorkingCelebA(datasets.CelebA):
# copied from https://pytorch.org/vision/main/_modules/torchvision/datasets/celeba.html#CelebA
def _check_integrity(self) -> bool:
for (_, md5, filename) in self.file_list:
fpath = os.path.join(self.root, self.base_folder, filename)
_, ext = os.path.splitext(filename)
# Allow original archive to be deleted (zip and 7z)
# Only need the extracted images
if ext in [".zip", ".7z"]:
continue
if not datasets.utils.check_integrity(fpath, md5):
# only printing, instead of returning False
print("Failed to check integrity of", fpath)
# return False
# Should check a hash of the images
return os.path.isdir(os.path.join(self.root, self.base_folder, "img_align_celeba"))
- constructing dataset as usual with this class.
Hey @Isaac-Hirsch, Thanks for reporting the issue, and sorry for the delayed response.
Iβm having trouble reproducing the problem. Please find below what I get using torchvision version 0.21.0+cu124 and torch version 2.6.0+cu124.
I also verified the MD5 checksums for all the ZIP files and compared them against the values in the files_list from the CelebA classβthey do match.
Are you still encountering the issue on your end?
CIFAR10 = torchvision.datasets.CelebA(root="Datasets", download=True)
Downloading...
From: https://drive.google.com/uc?id=0B7EVK8r0v71pZjFTYXZWM3FlRnM
To: /content/Datasets/celeba/img_align_celeba.zip
100%|ββββββββββ| 1.44G/1.44G [00:18<00:00, 79.6MB/s]
Downloading...
From: https://drive.google.com/uc?id=0B7EVK8r0v71pblRyaVFSWGxPY0U
To: /content/Datasets/celeba/list_attr_celeba.txt
100%|ββββββββββ| 26.7M/26.7M [00:00<00:00, 49.9MB/s]
Downloading...
From: https://drive.google.com/uc?id=1_ee_0u7vcNLOfNLegJRHmolfH5ICW-XS
To: /content/Datasets/celeba/identity_CelebA.txt
100%|ββββββββββ| 3.42M/3.42M [00:00<00:00, 37.4MB/s]
Downloading...
From: https://drive.google.com/uc?id=0B7EVK8r0v71pbThiMVRxWXZ4dU0
To: /content/Datasets/celeba/list_bbox_celeba.txt
100%|ββββββββββ| 6.08M/6.08M [00:00<00:00, 126MB/s]
Downloading...
From: https://drive.google.com/uc?id=0B7EVK8r0v71pd0FJY3Blby1HUTQ
To: /content/Datasets/celeba/list_landmarks_align_celeba.txt
100%|ββββββββββ| 12.2M/12.2M [00:00<00:00, 50.1MB/s]
Downloading...
From: https://drive.google.com/uc?id=0B7EVK8r0v71pY0NSMzRuSXJEVkk
To: /content/Datasets/celeba/list_eval_partition.txt
100%|ββββββββββ| 2.84M/2.84M [00:00<00:00, 163MB/s]