OLMo
OLMo copied to clipboard
Tokenizer for `prepare_tulu_data.py` not found
🐛 Describe the bug
First of all, thanks a lot for Open Sourcing OLMo!
I tried running the scripts/prepare_tulu_data.py
and faced the following error:
2024-02-02 05:36:05.619 5a0b0b9dc92e:0 olmo.util:152 CRITICAL Uncaught RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-65bc7f45-0e21e73019c220330516da6e;60158b38-c3b8-476f-9e0f-90513ca1b707)
Repository Not Found for url: https://huggingface.co/tokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special.json/resolve/main/tokenizer.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 270, in hf_raise_for_status
response.raise_for_status()
File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/tokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special.json/resolve/main/tokenizer.json
My understanding is that it stems from the below default parser arg in the prepare_tulu_data.py
file as it attempts to download the tokenizer but the default path is a local file instead of pointing to the repository.
parser.add_argument(
"-t",
"--tokenizer",
type=str,
help="""Tokenizer path or identifier.""",
default="tokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special.json", # this line
)
I tried removing the .json
from the end of the path but to no avail.
Am I doing something wrong and is there a solution?
TIA!
Versions
Python 3.10.12
python scripts/prepare_tulu_data.py ./train_data
is okay
Hey @tanaymeh can you post the full traceback and the exact command you ran? Thanks
I got a similar problem.
@epwalsh Thanks for responding! I ran the below command in Google Colab
%%sh
python /content/OLMo/scripts/prepare_tulu_data.py output_dir=/content/
to run the dataset preparation file. Below is the complete trace:
2024-02-02 05:56:02.520 5a0b0b9dc92e:0 olmo.util:152 CRITICAL Uncaught RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-65bc83f2-4c6b53e471ebcee444c9fb45;edb992e6-37ff-4be5-a071-04ac76629e6a)
Repository Not Found for url: https://huggingface.co/tokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special/resolve/main/tokenizer.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 270, in hf_raise_for_status
response.raise_for_status()
File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/tokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special/resolve/main/tokenizer.json
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/content/OLMo/scripts/prepare_tulu_data.py", line 131, in <module>
main(opts)
File "/content/OLMo/scripts/prepare_tulu_data.py", line 25, in main
tokenizer = Tokenizer.from_pretrained(opts.tokenizer, eos_token_id=opts.eos, pad_token_id=opts.pad)
File "/content/OLMo/olmo/tokenizer.py", line 83, in from_pretrained
base_tokenizer = BaseTokenizer.from_pretrained(identifier)
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1374, in hf_hub_download
raise head_call_error
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1247, in hf_hub_download
metadata = get_hf_file_metadata(
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1624, in get_hf_file_metadata
r = _request_wrapper(
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 402, in _request_wrapper
response = _request_wrapper(
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 426, in _request_wrapper
hf_raise_for_status(response)
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 320, in hf_raise_for_status
raise RepositoryNotFoundError(message, response) from e
huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-65bc83f2-4c6b53e471ebcee444c9fb45;edb992e6-37ff-4be5-a071-04ac76629e6a)
Repository Not Found for url: https://huggingface.co/tokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special/resolve/main/tokenizer.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.
---------------------------------------------------------------------------
CalledProcessError Traceback (most recent call last)
<ipython-input-11-39809f057efe> in <cell line: 1>()
----> 1 get_ipython().run_cell_magic('sh', '', 'python /content/OLMo/scripts/prepare_tulu_data.py output_dir=/content/\n')
4 frames
/usr/local/lib/python3.10/dist-packages/google/colab/_shell.py in run_cell_magic(self, magic_name, line, cell)
332 if line and not cell:
333 cell = ' '
--> 334 return super().run_cell_magic(magic_name, line, cell)
335
336
/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
2471 with self.builtin_trap:
2472 args = (magic_arg_s, cell)
-> 2473 result = fn(*args, **kwargs)
2474 return result
2475
/usr/local/lib/python3.10/dist-packages/IPython/core/magics/script.py in named_script_magic(line, cell)
140 else:
141 line = script
--> 142 return self.shebang(line, cell)
143
144 # write a basic docstring:
<decorator-gen-103> in shebang(self, line, cell)
/usr/local/lib/python3.10/dist-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
185 # but it's overkill for just that one bit of state.
186 def magic_deco(arg):
--> 187 call = lambda f, *a, **k: f(*a, **k)
188
189 if callable(arg):
/usr/local/lib/python3.10/dist-packages/IPython/core/magics/script.py in shebang(self, line, cell)
243 sys.stderr.flush()
244 if args.raise_error and p.returncode!=0:
--> 245 raise CalledProcessError(p.returncode, cell, output=out, stderr=err)
246
247 def _run_script(self, p, cell, to_close):
CalledProcessError: Command 'b'python /content/OLMo/scripts/prepare_tulu_data.py output_dir=/content/\n'' returned non-zero exit status 1.
Edit: It says: Repository Not Found for url: https://huggingface.co/tokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special/resolve/main/tokenizer.json.
but the same error persisted for allenai_eleuther-ai-gpt-neox-20b-pii-special.json
which the URL present in the original source code.
@tanaymeh, oh I think the issue is that the default value for --tokenizer
is a relative path, relative to the root of the OLMo repo. So if you run from that directory it should work, otherwise you can set --tokenizer=/content/OLMo/tokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special.json
.
Fixed in 80db5e3d.