nanoGPT Never succeed for downloading and splitting openwebtext

Who has the same issue?

Downloading and preparing dataset openwebtext/plain_text to C:/Users/liux3790/Desktop/download/cache/openwebtext/plain_text/1.0.0/85b3ae7051d2d72e7c5fdf6dfb462603aaa26e9ed506202bf3a24d261c6c40a1... Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████| 12.9G/12.9G [46:33<00:00, 4.61MB/s] Computing checksums of downloaded files. They can be used for integrity verification. You can disable this by passing ignore_verifications=True to load_dataset Computing checksums: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:12<00:00, 12.14s/it] C:\Users\liux3790\AppData\Local\miniconda3\lib\site-packages\datasets\download\download_manager.py:536: FutureWarning: 'num_proc' was deprecated in version 2.6.2 and will be removed in 3.0.0. Pass DownloadConfig(num_proc=<num_proc>) to the initializer instead. warnings.warn( Extracting data files: 100%|█████████████████████████████████████████████████████████████████████████| 20610/20610 [3:05:13<00:00, 1.85it/s] Generating train split: 0%|▏ | 30430/8013769 [02:58<14:42:20, 150.80 examples/s]Traceback (most recent call last): File "C:\Users\liux3790\AppData\Local\miniconda3\lib\site-packages\datasets\builder.py", line 1571, in _prepare_split_single for key, record in generator: File "C:\Users\liux3790.cache\huggingface\modules\datasets_modules\datasets\openwebtext\85b3ae7051d2d72e7c5fdf6dfb462603aaa26e9ed506202bf3a24d261c6c40a1\openwebtext.py", line 85, in _generate_examples File "C:\Users\liux3790\AppData\Local\miniconda3\lib\site-packages\datasets\streaming.py", line 70, in wrapper return function(*args, use_auth_token=use_auth_token, **kwargs) File "C:\Users\liux3790\AppData\Local\miniconda3\lib\site-packages\datasets\download\streaming_download_manager.py", line 482, in xopen return open(main_hop, mode, *args, **kwargs) OSError: [Errno 22] Invalid argument: 'C:\Users\liux3790\Desktop\download\cache\downloads\extracted\e492dab86df08fc0fc3601798767dd3c6db41e5f8caeb583dc1a84560657ec00\0015896-b1054262f7da52a0518521e29c8e352c.txt'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "c:\Users\liux3790\Desktop\nanoGPT\data\openwebtext\prepare.py", line 16, in dataset = load_dataset("openwebtext", cache_dir="C:/Users/liux3790/Desktop/download/cache") File "C:\Users\liux3790\AppData\Local\miniconda3\lib\site-packages\datasets\load.py", line 1758, in load_dataset builder_instance.download_and_prepare( File "C:\Users\liux3790\AppData\Local\miniconda3\lib\site-packages\datasets\builder.py", line 860, in download_and_prepare self._download_and_prepare( File "C:\Users\liux3790\AppData\Local\miniconda3\lib\site-packages\datasets\builder.py", line 1612, in _download_and_prepare super()._download_and_prepare( File "C:\Users\liux3790\AppData\Local\miniconda3\lib\site-packages\datasets\builder.py", line 953, in _download_and_prepare self._prepare_split(split_generator, **prepare_split_kwargs) File "C:\Users\liux3790\AppData\Local\miniconda3\lib\site-packages\datasets\builder.py", line 1450, in _prepare_split for job_id, done, content in self._prepare_split_single( File "C:\Users\liux3790\AppData\Local\miniconda3\lib\site-packages\datasets\builder.py", line 1607, in _prepare_split_single raise DatasetGenerationError("An error occurred while generating the dataset") from e datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

Feb 01 '23 01:02 hanfluid

In another issue it was mentioned that Windows Defender deletes suspicious files. Make sure to temporarily deactivate it. I think it's important to deactivate the real time protection that performs live scans on newly extracted files. I also had to delete the download folder in the huggingface .cache directory and had success when starting from scratch.

Feb 01 '23 17:02 lakaschus

hi guys .. i had the same problem, i have che cpu intel core i9-12900k and rtx 3080, i use the following parameters and it worked !

in data/openwebtext/prepare.py

# number of workers in .map() call
# good number to use is ~order number of cpu cores // 2
num_proc = 1

# takes 54GB in huggingface .cache dir, about 8M documents (8,013,769)
dataset = load_dataset("stas/openwebtext-10k")

thank you guys ! it worked !

Feb 17 '23 12:02 TheCyber91

nanoGPT nanoGPT copied to clipboard

Never succeed for downloading and splitting openwebtext

nanoGPT
nanoGPT copied to clipboard