datasets icon indicating copy to clipboard operation
datasets copied to clipboard

load_dataset using default cache on Windows causes PermissionError: [WinError 5] Access is denied

Open daqieq opened this issue 3 years ago • 4 comments

Describe the bug

Standard process to download and load the wiki_bio dataset causes PermissionError in Windows 10 and 11.

Steps to reproduce the bug

from datasets import load_dataset
ds = load_dataset('wiki_bio')

Expected results

It is expected that the dataset downloads without any errors.

Actual results

PermissionError see trace below:

Using custom data configuration default
Downloading and preparing dataset wiki_bio/default (download: 318.53 MiB, generated: 736.94 MiB, post-processed: Unknown size, total: 1.03 GiB) to C:\Users\username\.cache\huggingface\datasets\wiki_bio\default\1.1.0\5293ce565954ba965dada626f1e79684e98172d950371d266bf3caaf87e911c9...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\username\.conda\envs\hf\lib\site-packages\datasets\load.py", line 1112, in load_dataset
    builder_instance.download_and_prepare(
  File "C:\Users\username\.conda\envs\hf\lib\site-packages\datasets\builder.py", line 644, in download_and_prepare
    self._save_info()
  File "C:\Users\username\.conda\envs\hf\lib\contextlib.py", line 120, in __exit__
    next(self.gen)
  File "C:\Users\username\.conda\envs\hf\lib\site-packages\datasets\builder.py", line 598, in incomplete_dir
    os.rename(tmp_dir, dirname)
PermissionError: [WinError 5] Access is denied: 'C:\\Users\\username\\.cache\\huggingface\\datasets\\wiki_bio\\default\\1.1.0\\5293ce565954ba965dada626f1e79684e98172d950371d266bf3caaf87e911c9.incomplete' -> 'C:\\Users\\username\\.cache\\huggingface\\datasets\\wiki_bio\\default\\1.1.0\\5293ce565954ba965dada626f1e79684e98172d950371d266bf3caaf87e911c9'

By commenting out the os.rename() L604 and the shutil.rmtree() L607 lines, in my virtual environment, I was able to get the load process to complete, rename the directory manually and then rerun the load_dataset('wiki_bio') to get what I needed.

It seems that os.rename() in the incomplete_dir content manager is the culprit. Here's another project Conan with similar issue with os.rename() if it helps debug this issue.

Environment info

  • datasets version: 1.12.1
  • Platform: Windows-10-10.0.22449-SP0
  • Python version: 3.8.12
  • PyArrow version: 5.0.0

daqieq avatar Sep 17 '21 16:09 daqieq

Hi @daqieq, thanks for reporting.

Unfortunately, I was not able to reproduce this bug:

In [1]: from datasets import load_dataset
   ...: ds = load_dataset('wiki_bio')
Downloading: 7.58kB [00:00, 26.3kB/s]
Downloading: 2.71kB [00:00, ?B/s]
Using custom data configuration default
Downloading and preparing dataset wiki_bio/default (download: 318.53 MiB, generated: 736.94 MiB, post-processed: Unknown size, total: 1.03 GiB) to C:\Users\username\.cache\huggingface\datasets\wiki_bio\default\
1.1.0\5293ce565954ba965dada626f1e79684e98172d950371d266bf3caaf87e911c9...
Downloading: 334MB [01:17, 4.32MB/s]
Dataset wiki_bio downloaded and prepared to C:\Users\username\.cache\huggingface\datasets\wiki_bio\default\1.1.0\5293ce565954ba965dada626f1e79684e98172d950371d266bf3caaf87e911c9. Subsequent calls will reuse thi
s data.

This kind of error messages usually happen because:

  • Your running Python script hasn't write access to that directory
  • You have another program (the File Explorer?) already browsing inside that directory

albertvillanova avatar Sep 20 '21 07:09 albertvillanova

Thanks @albertvillanova for looking at it! I tried on my personal Windows machine and it downloaded just fine.

Running on my work machine and on a colleague's machine it is consistently hitting this error. It's not a write access issue because the .incomplete directory is written just fine. It just won't rename and then it deletes the directory in the finally step. Also the zip file is written and extracted fine in the downloads directory.

That leaves another program that might be interfering, and there are plenty of those in my work machine ... (full antivirus, data loss prevention, etc.). So the question remains, why not extend the try block to allow catching the error and circle back to the rename after the unknown program is finished doing its 'stuff'. This is the approach that I read about in the linked repo (see my comments above).

If it's not high priority, that's fine. However, if someone were to write an PR that solved this issue in our environment in an except clause, would it be reviewed for inclusion in a future release? Just wondering whether I should spend any more time on this issue.

daqieq avatar Sep 21 '21 02:09 daqieq

Hi @albertvillanova, even I am facing the same issue on my work machine:

Downloading and preparing dataset json/c4-en-html-with-metadata to C:\Users\......\.cache\huggingface\datasets\json\c4-en-html-with-metadata-4635c2fd9249f62d\0.0.0\c90812beea906fcffe0d5e3bb9eba909a80a998b5f88e9f8acbd320aa91acfde... 100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 983.42it/s] 100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 209.01it/s] Traceback (most recent call last): File "bsmetadata/preprocessing_utils.py", line 710, in <module> ds = load_dataset( File "C:\Users\.......\AppData\Roaming\Python\Python38\site-packages\datasets\load.py", line 1694, in load_dataset builder_instance.download_and_prepare( File "C:\Users\........\AppData\Roaming\Python\Python38\site-packages\datasets\builder.py", line 603, in download_and_prepare self._save_info() File "C:\Users\..........\AppData\Local\Programs\Python\Python38\lib\contextlib.py", line 120, in __exit__ next(self.gen) File "C:\Users\.....\AppData\Roaming\Python\Python38\site-packages\datasets\builder.py", line 557, in incomplete_dir os.rename(tmp_dir, dirname) PermissionError: [WinError 5] Access is denied: 'C:\\Users\\.........\\.cache\\huggingface\\datasets\\json\\c4-en-html-with-metadata-4635c2fd9249f62d\\0.0.0\\c90812beea906fcffe0d5e3bb9eba909a80a998b5f88e9f8acbd320aa91acfde.incomplete' -> 'C:\\Users\\I355109\\.cache\\huggingface\\datasets\\json\\c4-en-html-with-metadata-4635c2fd9249f62d\\0.0.0\\c90812beea906fcffe0d5e3bb9eba909a80a998b5f88e9f8acbd320aa91acfde'

manandey avatar Jan 29 '22 03:01 manandey

I'm facing the same issue.

System Information

  • OS Edition: Windows 10 21H1
  • OS build: 19043.1826
  • Python version: 3.10.6 (installed using choco install python)
  • datasets: 2.4.0
  • PyArrow: 6.0.1

Troubleshooting steps

  • Restart the computer, unfortunately doesn't work! 🌚
  • Checked the permissions of ~./cache/..., looks fine.
  • Tested with a simple file operation using the open() function and writing a hello_world.txt, it works fine.
  • Tested with a different cache_dir value on the load_dataset(), e.g. "./data"
  • Tested different datasets: conll2003, squad_v2, and wiki_bio.
  • Downgraded datasets from 2.4.0 to 2.1.0, issue persists.
  • Tested it on WSL (Ubuntu 20.04), and it works!
  • Python reinstallation, in the first time downloading conll2003 works fine, but squad or squad_v2 raises Access Denied.
    • After the system or VSCode restart, the issue comes back.

Resolution

I fixed it by changing the following command:

https://github.com/huggingface/datasets/blob/68cffe30917a9abed68d28caf54b40c10f977602/src/datasets/builder.py#L666

for

shutil.move(tmp_dir, dirname)

DougTrajano avatar Aug 10 '22 02:08 DougTrajano