transformers icon indicating copy to clipboard operation
transformers copied to clipboard

While weight conversion of llama-13b getting this error: RuntimeError: Internal: unk is not defined.

Open Ahtesham00 opened this issue 1 year ago • 2 comments

System Info

OS : Ubunto

Virtual Env :

accelerate==0.18.0 certifi==2022.12.7 charset-normalizer==3.1.0 cmake==3.26.3 filelock==3.12.0 huggingface-hub==0.13.4 idna==3.4 Jinja2==3.1.2 lit==16.0.1 MarkupSafe==2.1.2 mpmath==1.3.0 networkx==3.1 numpy==1.24.2 nvidia-cublas-cu11==11.10.3.66 nvidia-cuda-cupti-cu11==11.7.101 nvidia-cuda-nvrtc-cu11==11.7.99 nvidia-cuda-runtime-cu11==11.7.99 nvidia-cudnn-cu11==8.5.0.96 nvidia-cufft-cu11==10.9.0.58 nvidia-curand-cu11==10.2.10.91 nvidia-cusolver-cu11==11.4.0.1 nvidia-cusparse-cu11==11.7.4.91 nvidia-nccl-cu11==2.14.3 nvidia-nvtx-cu11==11.7.91 packaging==23.1 psutil==5.9.5 PyYAML==6.0 regex==2023.3.23 requests==2.28.2 sentencepiece==0.1.98 sympy==1.11.1 tokenizers==0.13.3 torch==2.0.0 tqdm==4.65.0 transformers==4.28.1 triton==2.0.0 typing_extensions==4.5.0 urllib3==1.26.15

Who can help?

@ArthurZucker @younesbelkada

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

Used following command to convert llama-13 weights into hf.

python src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir /home/unconveretd-weights --model_size 13B --output_dir /home/test-converted

Expected behavior

It should generated the converted weights. But instead it is generating this error

Loading the checkpoint in a Llama model. Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [00:17<00:00, 2.35it/s] Saving in the Transformers format. Saving a LlamaTokenizerFast to /home/test-converted. Traceback (most recent call last): File "/home/transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py", line 278, in main() File "/home/transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py", line 274, in main write_tokenizer(args.output_dir, spm_path) File "/home/transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py", line 248, in write_tokenizer tokenizer = tokenizer_class(input_tokenizer_path) File "/home/myenv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama_fast.py", line 89, in init super().init( File "/home/myenv/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 117, in init slow_tokenizer = self.slow_tokenizer_class(*args, **kwargs) File "/home/myenv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 96, in init self.sp_model.Load(vocab_file) File "/home/myenv/lib/python3.10/site-packages/sentencepiece/init.py", line 905, in Load return self.LoadFromFile(model_file) File "/home/myenv/lib/python3.10/site-packages/sentencepiece/init.py", line 310, in LoadFromFile return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) RuntimeError: Internal: unk is not defined.

Ahtesham00 avatar Apr 19 '23 19:04 Ahtesham00

facing the same issue.

Rachneet avatar Apr 19 '23 22:04 Rachneet

Hey! Thanks for reporting I'll investigate this!

ArthurZucker avatar Apr 24 '23 13:04 ArthurZucker

I have the same issue when I use the latest version of torch.

ChongWu-Biostat avatar May 18 '23 22:05 ChongWu-Biostat

I did not find the solution. but if someone wants to download the weights. following link has all the versions.

https://huggingface.co/elinas

Ahtesham00 avatar May 18 '23 22:05 Ahtesham00

Okay, We update the conversion script, which should have fixed most issues. I downloaded the tokenizer model, and re-tried the conversion, and I did not have any issue. Make sure you are using the latest transformers version.

ArthurZucker avatar May 25 '23 07:05 ArthurZucker

I tried with the latest code from the main branch, but still getting the same issue

image

dittops avatar May 27 '23 19:05 dittops

I am getting the same error message when running the conversion for the 7B model. Tried installing the latest version (4.29.2) but the error persists. Same traceback as @dittops but mine has a nicer formatting.

egoetz avatar Jun 06 '23 15:06 egoetz

Again, the issue is most probably with the tokenizer file that you are using, which is outdated. Yes you need to upgrade to the latest transformers version, but you also need to use the original sentencepiece model in order for the conversion to properly work!

ArthurZucker avatar Jun 06 '23 15:06 ArthurZucker

Thanks for following up. I have the llama weights/tokenizer that were updated on 3/26/23. Isn't that the latest version of the tokenizer?

Also I'm not sure what you mean by the original sentencepiece model (unless you mean the model from prior to the 3/26 update).

egoetz avatar Jun 06 '23 15:06 egoetz

When you say:

I have the llama weights/tokenizer that were updated on 3/26/23

do you mean the META weights and tokenizer? Otherwise can you share a notebook with a reproducer? The issue with llama is that a PR was made too early and thus lots of checkpoints and previous tokenizers (meaning hf tokenizers json) are incorrect.

ArthurZucker avatar Jun 06 '23 15:06 ArthurZucker

@ArthurZucker I have the META weights and tokenizer. The issue share is with that. For sentencepiece, is there a specific version to be used?

dittops avatar Jun 06 '23 15:06 dittops

I have the llama weights/tokenizer that were updated on 3/26/23

do you mean the META weights and tokenizer? Otherwise can you share a notebook with a reproducer? The issue with llama is that a PR was made too early and thus lots of checkpoints and previous tokenizers (meaning hf tokenizers json) are incorrect.

Ah I see. The llama weights I have come from Meta's torrent PR. I did not get them from HuggingFace, if you are referring to this PR.

egoetz avatar Jun 06 '23 17:06 egoetz

Ok 👍🏻 I'll give it another go, but I remember trying with those exact weights and getting a correct conversion. Will get back to you soon!

ArthurZucker avatar Jun 06 '23 18:06 ArthurZucker

Would you mind sending me the file via google drive? The torrent link seems down

ArthurZucker avatar Jun 07 '23 20:06 ArthurZucker

The torrent is showing as up for me right now, but if it isn't working for you I am happy to send you a copy of the 7B folder I am using. The entire folder for the 7B model is ~13-14GB. I'm trying to compress it right now but it will take a little bit to finish.

egoetz avatar Jun 07 '23 20:06 egoetz

Just the tokenizer files are enough!

ArthurZucker avatar Jun 07 '23 20:06 ArthurZucker

Email sent!

egoetz avatar Jun 08 '23 15:06 egoetz

@egoetz where you able to solve this issue?

dittops avatar Jun 16 '23 15:06 dittops

@egoetz told me that installing GIT LFS + using the tokenizer at huggyllama/llama-7b worked. I received the email but could not access files as they were not shared using drive but a private mail provider 😅 If you are trying to convert the original model (by that I mean going from the spm model to transformers) make sure you have the latest version of transformers

ArthurZucker avatar Jun 22 '23 12:06 ArthurZucker

I was able to resolve it by replacing tokenizer.model with one from hugging face. Thank you/

dittops avatar Jun 23 '23 15:06 dittops

I'm not sure I understand. If you are trying to convert a checkpoint/tokenizer, then you don't need to use an already converted one. The script is to go from the original tokenizer to the HF format.

ArthurZucker avatar Jun 26 '23 03:06 ArthurZucker

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jul 20 '23 15:07 github-actions[bot]