datasets
datasets copied to clipboard
load_dataset using default cache on Windows causes PermissionError: [WinError 5] Access is denied
Describe the bug
Standard process to download and load the wiki_bio dataset causes PermissionError in Windows 10 and 11.
Steps to reproduce the bug
from datasets import load_dataset
ds = load_dataset('wiki_bio')
Expected results
It is expected that the dataset downloads without any errors.
Actual results
PermissionError see trace below:
Using custom data configuration default
Downloading and preparing dataset wiki_bio/default (download: 318.53 MiB, generated: 736.94 MiB, post-processed: Unknown size, total: 1.03 GiB) to C:\Users\username\.cache\huggingface\datasets\wiki_bio\default\1.1.0\5293ce565954ba965dada626f1e79684e98172d950371d266bf3caaf87e911c9...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\username\.conda\envs\hf\lib\site-packages\datasets\load.py", line 1112, in load_dataset
builder_instance.download_and_prepare(
File "C:\Users\username\.conda\envs\hf\lib\site-packages\datasets\builder.py", line 644, in download_and_prepare
self._save_info()
File "C:\Users\username\.conda\envs\hf\lib\contextlib.py", line 120, in __exit__
next(self.gen)
File "C:\Users\username\.conda\envs\hf\lib\site-packages\datasets\builder.py", line 598, in incomplete_dir
os.rename(tmp_dir, dirname)
PermissionError: [WinError 5] Access is denied: 'C:\\Users\\username\\.cache\\huggingface\\datasets\\wiki_bio\\default\\1.1.0\\5293ce565954ba965dada626f1e79684e98172d950371d266bf3caaf87e911c9.incomplete' -> 'C:\\Users\\username\\.cache\\huggingface\\datasets\\wiki_bio\\default\\1.1.0\\5293ce565954ba965dada626f1e79684e98172d950371d266bf3caaf87e911c9'
By commenting out the os.rename() L604 and the shutil.rmtree() L607 lines, in my virtual environment, I was able to get the load process to complete, rename the directory manually and then rerun the load_dataset('wiki_bio')
to get what I needed.
It seems that os.rename() in the incomplete_dir
content manager is the culprit. Here's another project Conan with similar issue with os.rename() if it helps debug this issue.
Environment info
-
datasets
version: 1.12.1 - Platform: Windows-10-10.0.22449-SP0
- Python version: 3.8.12
- PyArrow version: 5.0.0
Hi @daqieq, thanks for reporting.
Unfortunately, I was not able to reproduce this bug:
In [1]: from datasets import load_dataset
...: ds = load_dataset('wiki_bio')
Downloading: 7.58kB [00:00, 26.3kB/s]
Downloading: 2.71kB [00:00, ?B/s]
Using custom data configuration default
Downloading and preparing dataset wiki_bio/default (download: 318.53 MiB, generated: 736.94 MiB, post-processed: Unknown size, total: 1.03 GiB) to C:\Users\username\.cache\huggingface\datasets\wiki_bio\default\
1.1.0\5293ce565954ba965dada626f1e79684e98172d950371d266bf3caaf87e911c9...
Downloading: 334MB [01:17, 4.32MB/s]
Dataset wiki_bio downloaded and prepared to C:\Users\username\.cache\huggingface\datasets\wiki_bio\default\1.1.0\5293ce565954ba965dada626f1e79684e98172d950371d266bf3caaf87e911c9. Subsequent calls will reuse thi
s data.
This kind of error messages usually happen because:
- Your running Python script hasn't write access to that directory
- You have another program (the File Explorer?) already browsing inside that directory
Thanks @albertvillanova for looking at it! I tried on my personal Windows machine and it downloaded just fine.
Running on my work machine and on a colleague's machine it is consistently hitting this error. It's not a write access issue because the .incomplete
directory is written just fine. It just won't rename and then it deletes the directory in the finally
step. Also the zip file is written and extracted fine in the downloads directory.
That leaves another program that might be interfering, and there are plenty of those in my work machine ... (full antivirus, data loss prevention, etc.). So the question remains, why not extend the try
block to allow catching the error and circle back to the rename after the unknown program is finished doing its 'stuff'. This is the approach that I read about in the linked repo (see my comments above).
If it's not high priority, that's fine. However, if someone were to write an PR that solved this issue in our environment in an except
clause, would it be reviewed for inclusion in a future release? Just wondering whether I should spend any more time on this issue.
Hi @albertvillanova, even I am facing the same issue on my work machine:
Downloading and preparing dataset json/c4-en-html-with-metadata to C:\Users\......\.cache\huggingface\datasets\json\c4-en-html-with-metadata-4635c2fd9249f62d\0.0.0\c90812beea906fcffe0d5e3bb9eba909a80a998b5f88e9f8acbd320aa91acfde... 100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 983.42it/s] 100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 209.01it/s] Traceback (most recent call last): File "bsmetadata/preprocessing_utils.py", line 710, in <module> ds = load_dataset( File "C:\Users\.......\AppData\Roaming\Python\Python38\site-packages\datasets\load.py", line 1694, in load_dataset builder_instance.download_and_prepare( File "C:\Users\........\AppData\Roaming\Python\Python38\site-packages\datasets\builder.py", line 603, in download_and_prepare self._save_info() File "C:\Users\..........\AppData\Local\Programs\Python\Python38\lib\contextlib.py", line 120, in __exit__ next(self.gen) File "C:\Users\.....\AppData\Roaming\Python\Python38\site-packages\datasets\builder.py", line 557, in incomplete_dir os.rename(tmp_dir, dirname) PermissionError: [WinError 5] Access is denied: 'C:\\Users\\.........\\.cache\\huggingface\\datasets\\json\\c4-en-html-with-metadata-4635c2fd9249f62d\\0.0.0\\c90812beea906fcffe0d5e3bb9eba909a80a998b5f88e9f8acbd320aa91acfde.incomplete' -> 'C:\\Users\\I355109\\.cache\\huggingface\\datasets\\json\\c4-en-html-with-metadata-4635c2fd9249f62d\\0.0.0\\c90812beea906fcffe0d5e3bb9eba909a80a998b5f88e9f8acbd320aa91acfde'
I'm facing the same issue.
System Information
- OS Edition: Windows 10 21H1
- OS build: 19043.1826
- Python version: 3.10.6 (installed using
choco install python
) - datasets: 2.4.0
- PyArrow: 6.0.1
Troubleshooting steps
- Restart the computer, unfortunately doesn't work! 🌚
- Checked the permissions of
~./cache/...
, looks fine. - Tested with a simple file operation using the
open()
function and writing a hello_world.txt, it works fine. - Tested with a different
cache_dir
value on theload_dataset()
, e.g. "./data" - Tested different datasets:
conll2003
,squad_v2
, andwiki_bio
. - Downgraded datasets from
2.4.0
to2.1.0
, issue persists. - Tested it on WSL (Ubuntu 20.04), and it works!
- Python reinstallation, in the first time downloading
conll2003
works fine, butsquad
orsquad_v2
raises Access Denied.- After the system or VSCode restart, the issue comes back.
Resolution
I fixed it by changing the following command:
https://github.com/huggingface/datasets/blob/68cffe30917a9abed68d28caf54b40c10f977602/src/datasets/builder.py#L666
for
shutil.move(tmp_dir, dirname)