OLMo Tokenizer for `prepare_tulu

🐛 Describe the bug

First of all, thanks a lot for Open Sourcing OLMo!

I tried running the scripts/prepare_tulu_data.py and faced the following error:

2024-02-02 05:36:05.619	5a0b0b9dc92e:0	olmo.util:152	CRITICAL	Uncaught RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-65bc7f45-0e21e73019c220330516da6e;60158b38-c3b8-476f-9e0f-90513ca1b707)

Repository Not Found for url: https://huggingface.co/tokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special.json/resolve/main/tokenizer.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 270, in hf_raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/tokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special.json/resolve/main/tokenizer.json

My understanding is that it stems from the below default parser arg in the prepare_tulu_data.py file as it attempts to download the tokenizer but the default path is a local file instead of pointing to the repository.

    parser.add_argument(
        "-t",
        "--tokenizer",
        type=str,
        help="""Tokenizer path or identifier.""",
        default="tokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special.json", # this line
    )

I tried removing the .json from the end of the path but to no avail.

Am I doing something wrong and is there a solution?

TIA!

Versions

Python 3.10.12

Feb 02 '24 06:02 tanaymeh

python scripts/prepare_tulu_data.py ./train_data is okay

Feb 02 '24 08:02 linpan

Hey @tanaymeh can you post the full traceback and the exact command you ran? Thanks

Feb 02 '24 16:02 epwalsh

I got a similar problem.

Feb 05 '24 03:02 HenryHTH2020

@epwalsh Thanks for responding! I ran the below command in Google Colab

%%sh
python /content/OLMo/scripts/prepare_tulu_data.py output_dir=/content/

to run the dataset preparation file. Below is the complete trace:

2024-02-02 05:56:02.520	5a0b0b9dc92e:0	olmo.util:152	CRITICAL	Uncaught RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-65bc83f2-4c6b53e471ebcee444c9fb45;edb992e6-37ff-4be5-a071-04ac76629e6a)

Repository Not Found for url: https://huggingface.co/tokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special/resolve/main/tokenizer.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 270, in hf_raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/tokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special/resolve/main/tokenizer.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/content/OLMo/scripts/prepare_tulu_data.py", line 131, in <module>
    main(opts)
  File "/content/OLMo/scripts/prepare_tulu_data.py", line 25, in main
    tokenizer = Tokenizer.from_pretrained(opts.tokenizer, eos_token_id=opts.eos, pad_token_id=opts.pad)
  File "/content/OLMo/olmo/tokenizer.py", line 83, in from_pretrained
    base_tokenizer = BaseTokenizer.from_pretrained(identifier)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1374, in hf_hub_download
    raise head_call_error
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1247, in hf_hub_download
    metadata = get_hf_file_metadata(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1624, in get_hf_file_metadata
    r = _request_wrapper(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 402, in _request_wrapper
    response = _request_wrapper(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 426, in _request_wrapper
    hf_raise_for_status(response)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 320, in hf_raise_for_status
    raise RepositoryNotFoundError(message, response) from e
huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-65bc83f2-4c6b53e471ebcee444c9fb45;edb992e6-37ff-4be5-a071-04ac76629e6a)

Repository Not Found for url: https://huggingface.co/tokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special/resolve/main/tokenizer.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
<ipython-input-11-39809f057efe> in <cell line: 1>()
----> 1 get_ipython().run_cell_magic('sh', '', 'python /content/OLMo/scripts/prepare_tulu_data.py output_dir=/content/\n')

4 frames
/usr/local/lib/python3.10/dist-packages/google/colab/_shell.py in run_cell_magic(self, magic_name, line, cell)
    332     if line and not cell:
    333       cell = ' '
--> 334     return super().run_cell_magic(magic_name, line, cell)
    335 
    336 

/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
   2471             with self.builtin_trap:
   2472                 args = (magic_arg_s, cell)
-> 2473                 result = fn(*args, **kwargs)
   2474             return result
   2475 

/usr/local/lib/python3.10/dist-packages/IPython/core/magics/script.py in named_script_magic(line, cell)
    140             else:
    141                 line = script
--> 142             return self.shebang(line, cell)
    143 
    144         # write a basic docstring:

<decorator-gen-103> in shebang(self, line, cell)

/usr/local/lib/python3.10/dist-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
    185     # but it's overkill for just that one bit of state.
    186     def magic_deco(arg):
--> 187         call = lambda f, *a, **k: f(*a, **k)
    188 
    189         if callable(arg):

/usr/local/lib/python3.10/dist-packages/IPython/core/magics/script.py in shebang(self, line, cell)
    243             sys.stderr.flush()
    244         if args.raise_error and p.returncode!=0:
--> 245             raise CalledProcessError(p.returncode, cell, output=out, stderr=err)
    246 
    247     def _run_script(self, p, cell, to_close):

CalledProcessError: Command 'b'python /content/OLMo/scripts/prepare_tulu_data.py output_dir=/content/\n'' returned non-zero exit status 1.

Edit: It says: Repository Not Found for url: https://huggingface.co/tokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special/resolve/main/tokenizer.json. but the same error persisted for allenai_eleuther-ai-gpt-neox-20b-pii-special.json which the URL present in the original source code.

Feb 05 '24 06:02 tanaymeh

@tanaymeh, oh I think the issue is that the default value for --tokenizer is a relative path, relative to the root of the OLMo repo. So if you run from that directory it should work, otherwise you can set --tokenizer=/content/OLMo/tokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special.json.

Feb 05 '24 16:02 epwalsh

Fixed in 80db5e3d.

Feb 05 '24 16:02 epwalsh

OLMo
OLMo copied to clipboard

Tokenizer for `prepare_tulu_data.py` not found

🐛 Describe the bug

Versions

OLMo OLMo copied to clipboard

Tokenizer for `prepare_tulu_data.py` not found

🐛 Describe the bug

Versions

OLMo
OLMo copied to clipboard