nanoGPT
nanoGPT copied to clipboard
Dataset load
Hello I've an issue while loading my dataset in prepare.py (for obenwebtext). The download and the extraction complete successfully but the generation of train split raise an error.
I've already try to look for the file 0180327-a95f1342cd685fb7d22805aa720870d2.txt in the archive and add it manually to the extracted dataset but it doesn't work. The ignore_verification is False.
If you need more informations I can give you whatever you need
Thanks for your help
Config :
- AMD Ryzen 5 5600X
- Nvidia 3060ti (CUDA 11.7)
- 32Gb RAM (3200Mhz/CAS16)
- Windows 10 64bits
- Python 3.9.13 (virtualenv)
Computing checksums of downloaded files. They can be used for integrity verification. You can disable this by passing ignore_verifications=True to load_dataset
Computing checksums: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:08<00:00, 8.82s/it]
C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\download\download_manager.py:431: FutureWarning: 'num_proc' was deprecated in version 2.6.2 and will be removed in 3.0.0. Pass `DownloadConfig(num_proc=<num_proc>)` to the initializer instead.
warnings.warn(
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20610/20610 [05:27<00:00, 62.85it/s]
Generating train split: 0%|▋ | 35271/8013769 [01:43<2:24:57, 917.33 examples/s]Traceback (most recent call last):
File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\builder.py", line 1570, in _prepare_split_single
for key, record in generator:
File "C:\Users\emili\.cache\huggingface\modules\datasets_modules\datasets\openwebtext\85b3ae7051d2d72e7c5fdf6dfb462603aaa26e9ed506202bf3a24d261c6c40a1\openwebtext.py", line 85, in _generate_examples
with open(filepath, encoding="utf-8") as f:
File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\streaming.py", line 69, in wrapper
return function(*args, use_auth_token=use_auth_token, **kwargs)
File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\download\streaming_download_manager.py", line 445, in xopen
return open(main_hop, mode, *args, **kwargs)
OSError: [Errno 22] Invalid argument: 'C:\\Users\\emili\\.cache\\huggingface\\datasets\\downloads\\extracted\\85b7a70ee547a4372aa7cf8fab0e93cd8849e09e1cba8454c1d113746400e918\\0180327-a95f1342cd685fb7d22805aa720870d2.txt'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "c:\Users\emili\Desktop\nanoGPT\data\openwebtext\prepare.py", line 15, in <module>
dataset = load_dataset("openwebtext")
File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\load.py", line 1757, in load_dataset
builder_instance.download_and_prepare(
File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\builder.py", line 860, in download_and_prepare
self._download_and_prepare(
File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\builder.py", line 1611, in _download_and_prepare
super()._download_and_prepare(
File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\builder.py", line 953, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\builder.py", line 1449, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\builder.py", line 1606, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset
Same problem, any ideas?
it looks like windows is deleting files that contains js shellcode exploits causing the load to fail.
Same problem, any idea?
Set num_proc = 1 and shut down All Windows Virus & threat protection and Firewall &network protection solved the problem.
Same issue here.