TinyLlama
TinyLlama copied to clipboard
Unable to pretrain: tokenizer raises NotImplementedError
When following PRETRAIN.md and running one of the data prep scripts:
python scripts/prepare_slimpajama.py --source_path datasets/SlimPajama-627B/ --tokenizer_path data/llama --destination_path data/slim_star_combined --split validation --percentage 1.0
The tokenizer throws this. It seems a checkpoint is first needed, data/llama
? How do you get this?
Process Process-1:
Traceback (most recent call last):
File "/home/ubuntu/.asdf/installs/python/3.8.18/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/ubuntu/.asdf/installs/python/3.8.18/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "scripts/prepare_slimpajama.py", line 39, in prepare_full
tokenizer = Tokenizer(tokenizer_path)
File "/home/ubuntu/TinyLlama/lit_gpt/tokenizer.py", line 29, in __init__
raise NotImplementedError
NotImplementedError
Process Process-2:
Traceback (most recent call last):
File "/home/ubuntu/.asdf/installs/python/3.8.18/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/ubuntu/.asdf/installs/python/3.8.18/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "scripts/prepare_slimpajama.py", line 39, in prepare_full
tokenizer = Tokenizer(tokenizer_path)
File "/home/ubuntu/TinyLlama/lit_gpt/tokenizer.py", line 29, in __init__
raise NotImplementedError
NotImplementedError
Process Process-3:
Traceback (most recent call last):
File "/home/ubuntu/.asdf/installs/python/3.8.18/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/ubuntu/.asdf/installs/python/3.8.18/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "scripts/prepare_slimpajama.py", line 39, in prepare_full
tokenizer = Tokenizer(tokenizer_path)
File "/home/ubuntu/TinyLlama/lit_gpt/tokenizer.py", line 29, in __init__
raise NotImplementedError
NotImplementedError
Process Process-4:
Traceback (most recent call last):
File "/home/ubuntu/.asdf/installs/python/3.8.18/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/ubuntu/.asdf/installs/python/3.8.18/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "scripts/prepare_slimpajama.py", line 39, in prepare_full
tokenizer = Tokenizer(tokenizer_path)
File "/home/ubuntu/TinyLlama/lit_gpt/tokenizer.py", line 29, in __init__
raise NotImplementedError
NotImplementedError
Time taken: 0.02 seconds
I've met the same error. If you fixed it, let me know please
Hi, you can download the tokenizer with mkdir data && cd data && mkdir llama && cd llama && wget https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-480k-1T/blob/main/tokenizer.model && cd ../..
That URL will serve you a redirect, so wget will download an html file and name it tokenizer.model.