nanoGPT
nanoGPT copied to clipboard
Never succeed for downloading and splitting openwebtext
Who has the same issue?
Downloading and preparing dataset openwebtext/plain_text to C:/Users/liux3790/Desktop/download/cache/openwebtext/plain_text/1.0.0/85b3ae7051d2d72e7c5fdf6dfb462603aaa26e9ed506202bf3a24d261c6c40a1...
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████| 12.9G/12.9G [46:33<00:00, 4.61MB/s]
Computing checksums of downloaded files. They can be used for integrity verification. You can disable this by passing ignore_verifications=True to load_dataset
Computing checksums: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:12<00:00, 12.14s/it]
C:\Users\liux3790\AppData\Local\miniconda3\lib\site-packages\datasets\download\download_manager.py:536: FutureWarning: 'num_proc' was deprecated in version 2.6.2 and will be removed in 3.0.0. Pass DownloadConfig(num_proc=<num_proc>)
to the initializer instead.
warnings.warn(
Extracting data files: 100%|█████████████████████████████████████████████████████████████████████████| 20610/20610 [3:05:13<00:00, 1.85it/s]
Generating train split: 0%|▏ | 30430/8013769 [02:58<14:42:20, 150.80 examples/s]Traceback (most recent call last):
File "C:\Users\liux3790\AppData\Local\miniconda3\lib\site-packages\datasets\builder.py", line 1571, in _prepare_split_single
for key, record in generator:
File "C:\Users\liux3790.cache\huggingface\modules\datasets_modules\datasets\openwebtext\85b3ae7051d2d72e7c5fdf6dfb462603aaa26e9ed506202bf3a24d261c6c40a1\openwebtext.py", line 85, in _generate_examples
File "C:\Users\liux3790\AppData\Local\miniconda3\lib\site-packages\datasets\streaming.py", line 70, in wrapper
return function(*args, use_auth_token=use_auth_token, **kwargs)
File "C:\Users\liux3790\AppData\Local\miniconda3\lib\site-packages\datasets\download\streaming_download_manager.py", line 482, in xopen
return open(main_hop, mode, *args, **kwargs)
OSError: [Errno 22] Invalid argument: 'C:\Users\liux3790\Desktop\download\cache\downloads\extracted\e492dab86df08fc0fc3601798767dd3c6db41e5f8caeb583dc1a84560657ec00\0015896-b1054262f7da52a0518521e29c8e352c.txt'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "c:\Users\liux3790\Desktop\nanoGPT\data\openwebtext\prepare.py", line 16, in
In another issue it was mentioned that Windows Defender deletes suspicious files. Make sure to temporarily deactivate it. I think it's important to deactivate the real time protection that performs live scans on newly extracted files. I also had to delete the download folder in the huggingface .cache directory and had success when starting from scratch.
hi guys .. i had the same problem, i have che cpu intel core i9-12900k and rtx 3080, i use the following parameters and it worked !
in data/openwebtext/prepare.py
# number of workers in .map() call
# good number to use is ~order number of cpu cores // 2
num_proc = 1
# takes 54GB in huggingface .cache dir, about 8M documents (8,013,769)
dataset = load_dataset("stas/openwebtext-10k")
thank you guys ! it worked !