nanoGPT icon indicating copy to clipboard operation
nanoGPT copied to clipboard

Dataset load

Open thremilien opened this issue 2 years ago • 5 comments

Hello I've an issue while loading my dataset in prepare.py (for obenwebtext). The download and the extraction complete successfully but the generation of train split raise an error.

I've already try to look for the file 0180327-a95f1342cd685fb7d22805aa720870d2.txt in the archive and add it manually to the extracted dataset but it doesn't work. The ignore_verification is False.

If you need more informations I can give you whatever you need

Thanks for your help

Config :

  • AMD Ryzen 5 5600X
  • Nvidia 3060ti (CUDA 11.7)
  • 32Gb RAM (3200Mhz/CAS16)
  • Windows 10 64bits
  • Python 3.9.13 (virtualenv)
Computing checksums of downloaded files. They can be used for integrity verification. You can disable this by passing ignore_verifications=True to load_dataset
Computing checksums: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:08<00:00,  8.82s/it]
C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\download\download_manager.py:431: FutureWarning: 'num_proc' was deprecated in version 2.6.2 and will be removed in 3.0.0. Pass `DownloadConfig(num_proc=<num_proc>)` to the initializer instead.
  warnings.warn(
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20610/20610 [05:27<00:00, 62.85it/s]
Generating train split:   0%|▋                                                                                                                                                | 35271/8013769 [01:43<2:24:57, 917.33 examples/s]Traceback (most recent call last):
  File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\builder.py", line 1570, in _prepare_split_single
    for key, record in generator:
  File "C:\Users\emili\.cache\huggingface\modules\datasets_modules\datasets\openwebtext\85b3ae7051d2d72e7c5fdf6dfb462603aaa26e9ed506202bf3a24d261c6c40a1\openwebtext.py", line 85, in _generate_examples
    with open(filepath, encoding="utf-8") as f:
  File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\streaming.py", line 69, in wrapper
    return function(*args, use_auth_token=use_auth_token, **kwargs)
  File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\download\streaming_download_manager.py", line 445, in xopen
    return open(main_hop, mode, *args, **kwargs)
OSError: [Errno 22] Invalid argument: 'C:\\Users\\emili\\.cache\\huggingface\\datasets\\downloads\\extracted\\85b7a70ee547a4372aa7cf8fab0e93cd8849e09e1cba8454c1d113746400e918\\0180327-a95f1342cd685fb7d22805aa720870d2.txt'    

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\emili\Desktop\nanoGPT\data\openwebtext\prepare.py", line 15, in <module>
    dataset = load_dataset("openwebtext")
  File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\load.py", line 1757, in load_dataset
    builder_instance.download_and_prepare(
  File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\builder.py", line 860, in download_and_prepare
    self._download_and_prepare(
  File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\builder.py", line 1611, in _download_and_prepare
    super()._download_and_prepare(
  File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\builder.py", line 953, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\builder.py", line 1449, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\builder.py", line 1606, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

thremilien avatar Jan 25 '23 10:01 thremilien

Same problem, any ideas?

zjsuper avatar Jan 26 '23 21:01 zjsuper

it looks like windows is deleting files that contains js shellcode exploits causing the load to fail.

Coriana avatar Jan 29 '23 11:01 Coriana

Same problem, any idea?

patrobadri avatar Jan 30 '23 09:01 patrobadri

Set num_proc = 1 and shut down All Windows Virus & threat protection and Firewall &network protection solved the problem.

zjsuper avatar Jan 30 '23 15:01 zjsuper

Same issue here.

hanfluid avatar Jan 31 '23 16:01 hanfluid