TinyLlama icon indicating copy to clipboard operation
TinyLlama copied to clipboard

Unable to pretrain: tokenizer raises NotImplementedError

Open zxti opened this issue 1 year ago • 1 comments

When following PRETRAIN.md and running one of the data prep scripts:

python scripts/prepare_slimpajama.py --source_path datasets/SlimPajama-627B/ --tokenizer_path data/llama --destination_path data/slim_star_combined --split validation --percentage 1.0

The tokenizer throws this. It seems a checkpoint is first needed, data/llama? How do you get this?

Process Process-1:
Traceback (most recent call last):
  File "/home/ubuntu/.asdf/installs/python/3.8.18/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ubuntu/.asdf/installs/python/3.8.18/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "scripts/prepare_slimpajama.py", line 39, in prepare_full
    tokenizer = Tokenizer(tokenizer_path)
  File "/home/ubuntu/TinyLlama/lit_gpt/tokenizer.py", line 29, in __init__
    raise NotImplementedError
NotImplementedError
Process Process-2:
Traceback (most recent call last):
  File "/home/ubuntu/.asdf/installs/python/3.8.18/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ubuntu/.asdf/installs/python/3.8.18/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "scripts/prepare_slimpajama.py", line 39, in prepare_full
    tokenizer = Tokenizer(tokenizer_path)
  File "/home/ubuntu/TinyLlama/lit_gpt/tokenizer.py", line 29, in __init__
    raise NotImplementedError
NotImplementedError
Process Process-3:
Traceback (most recent call last):
  File "/home/ubuntu/.asdf/installs/python/3.8.18/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ubuntu/.asdf/installs/python/3.8.18/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "scripts/prepare_slimpajama.py", line 39, in prepare_full
    tokenizer = Tokenizer(tokenizer_path)
  File "/home/ubuntu/TinyLlama/lit_gpt/tokenizer.py", line 29, in __init__
    raise NotImplementedError
NotImplementedError
Process Process-4:
Traceback (most recent call last):
  File "/home/ubuntu/.asdf/installs/python/3.8.18/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ubuntu/.asdf/installs/python/3.8.18/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "scripts/prepare_slimpajama.py", line 39, in prepare_full
    tokenizer = Tokenizer(tokenizer_path)
  File "/home/ubuntu/TinyLlama/lit_gpt/tokenizer.py", line 29, in __init__
    raise NotImplementedError
NotImplementedError
Time taken: 0.02 seconds

zxti avatar Jan 19 '24 00:01 zxti

I've met the same error. If you fixed it, let me know please

m0Nst3r873 avatar Jan 21 '24 18:01 m0Nst3r873

Hi, you can download the tokenizer with mkdir data && cd data && mkdir llama && cd llama && wget https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-480k-1T/blob/main/tokenizer.model && cd ../..

ChaosCodes avatar Feb 08 '24 14:02 ChaosCodes

That URL will serve you a redirect, so wget will download an html file and name it tokenizer.model.

awgr avatar Mar 09 '24 09:03 awgr