nanoGPT icon indicating copy to clipboard operation
nanoGPT copied to clipboard

Never succeed for downloading and splitting openwebtext

Open hanfluid opened this issue 1 year ago • 2 comments

Who has the same issue?

Downloading and preparing dataset openwebtext/plain_text to C:/Users/liux3790/Desktop/download/cache/openwebtext/plain_text/1.0.0/85b3ae7051d2d72e7c5fdf6dfb462603aaa26e9ed506202bf3a24d261c6c40a1... Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████| 12.9G/12.9G [46:33<00:00, 4.61MB/s] Computing checksums of downloaded files. They can be used for integrity verification. You can disable this by passing ignore_verifications=True to load_dataset Computing checksums: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:12<00:00, 12.14s/it] C:\Users\liux3790\AppData\Local\miniconda3\lib\site-packages\datasets\download\download_manager.py:536: FutureWarning: 'num_proc' was deprecated in version 2.6.2 and will be removed in 3.0.0. Pass DownloadConfig(num_proc=<num_proc>) to the initializer instead. warnings.warn( Extracting data files: 100%|█████████████████████████████████████████████████████████████████████████| 20610/20610 [3:05:13<00:00, 1.85it/s] Generating train split: 0%|▏ | 30430/8013769 [02:58<14:42:20, 150.80 examples/s]Traceback (most recent call last): File "C:\Users\liux3790\AppData\Local\miniconda3\lib\site-packages\datasets\builder.py", line 1571, in _prepare_split_single for key, record in generator: File "C:\Users\liux3790.cache\huggingface\modules\datasets_modules\datasets\openwebtext\85b3ae7051d2d72e7c5fdf6dfb462603aaa26e9ed506202bf3a24d261c6c40a1\openwebtext.py", line 85, in _generate_examples File "C:\Users\liux3790\AppData\Local\miniconda3\lib\site-packages\datasets\streaming.py", line 70, in wrapper return function(*args, use_auth_token=use_auth_token, **kwargs) File "C:\Users\liux3790\AppData\Local\miniconda3\lib\site-packages\datasets\download\streaming_download_manager.py", line 482, in xopen return open(main_hop, mode, *args, **kwargs) OSError: [Errno 22] Invalid argument: 'C:\Users\liux3790\Desktop\download\cache\downloads\extracted\e492dab86df08fc0fc3601798767dd3c6db41e5f8caeb583dc1a84560657ec00\0015896-b1054262f7da52a0518521e29c8e352c.txt'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "c:\Users\liux3790\Desktop\nanoGPT\data\openwebtext\prepare.py", line 16, in dataset = load_dataset("openwebtext", cache_dir="C:/Users/liux3790/Desktop/download/cache") File "C:\Users\liux3790\AppData\Local\miniconda3\lib\site-packages\datasets\load.py", line 1758, in load_dataset builder_instance.download_and_prepare( File "C:\Users\liux3790\AppData\Local\miniconda3\lib\site-packages\datasets\builder.py", line 860, in download_and_prepare self._download_and_prepare( File "C:\Users\liux3790\AppData\Local\miniconda3\lib\site-packages\datasets\builder.py", line 1612, in _download_and_prepare super()._download_and_prepare( File "C:\Users\liux3790\AppData\Local\miniconda3\lib\site-packages\datasets\builder.py", line 953, in _download_and_prepare self._prepare_split(split_generator, **prepare_split_kwargs) File "C:\Users\liux3790\AppData\Local\miniconda3\lib\site-packages\datasets\builder.py", line 1450, in _prepare_split for job_id, done, content in self._prepare_split_single( File "C:\Users\liux3790\AppData\Local\miniconda3\lib\site-packages\datasets\builder.py", line 1607, in _prepare_split_single raise DatasetGenerationError("An error occurred while generating the dataset") from e datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

hanfluid avatar Feb 01 '23 01:02 hanfluid