nanoGPT icon indicating copy to clipboard operation
nanoGPT copied to clipboard

running data/openwebtext/prepare.py gives "enc is not defined" error

Open rfernand2 opened this issue 1 year ago • 1 comments

On Windows 11, Python 3.9.0, when running prepare.py, it gets an error when tokenizing the splits. The callstack shows the error at line 50, but it actually occurs on line 42, in the process() function. The "enc" defined at line 39 is not seen when process() is called.

An easy (and verified) workaround: copy line 39 into the first line of process().

FYI, here's the full callstack:

(tpx) d:\github\nanoGPT>python data/openwebtext/prepare.py
tokenizing the splits (num_proc=8):   0%|                                                                                  | 0/8009762 [00:07<?, ? examples/s]
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\multiprocess\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\utils\py_utils.py", line 1354, in _write_generator_to_queue
    for i, result in enumerate(func(**kwargs)):
  File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\arrow_dataset.py", line 3450, in _map_single
    example = apply_function_on_filtered_inputs(example, i, offset=offset)
  File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\arrow_dataset.py", line 3353, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "d:\github\nanoGPT\data\openwebtext\prepare.py", line 43, in process
    ids = enc.encode_ordinary(example['text']) # encode_ordinary ignores any special tokens
NameError: name 'enc' is not defined
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "d:\github\nanoGPT\data\openwebtext\prepare.py", line 50, in <module>
    tokenized = split_dataset.map(
  File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\dataset_dict.py", line 853, in map
    {
  File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\dataset_dict.py", line 854, in <dictcomp>
    k: dataset.map(
  File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\arrow_dataset.py", line 592, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\arrow_dataset.py", line 3189, in map
    for rank, done, content in iflatmap_unordered(
  File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\utils\py_utils.py", line 1394, in iflatmap_unordered
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\utils\py_utils.py", line 1394, in <listcomp>
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\multiprocess\pool.py", line 771, in get
    raise self._value
NameError: name 'enc' is not defined

rfernand2 avatar Sep 06 '23 17:09 rfernand2

Did you ever sort this one out?

jdietzChina avatar Dec 04 '23 11:12 jdietzChina

The PR from @vinjn above ^^ worked for me.

calmitchell617 avatar Feb 09 '24 16:02 calmitchell617