transformers
transformers copied to clipboard
While weight conversion of llama-13b getting this error: RuntimeError: Internal: unk is not defined.
System Info
OS : Ubunto
Virtual Env :
accelerate==0.18.0 certifi==2022.12.7 charset-normalizer==3.1.0 cmake==3.26.3 filelock==3.12.0 huggingface-hub==0.13.4 idna==3.4 Jinja2==3.1.2 lit==16.0.1 MarkupSafe==2.1.2 mpmath==1.3.0 networkx==3.1 numpy==1.24.2 nvidia-cublas-cu11==11.10.3.66 nvidia-cuda-cupti-cu11==11.7.101 nvidia-cuda-nvrtc-cu11==11.7.99 nvidia-cuda-runtime-cu11==11.7.99 nvidia-cudnn-cu11==8.5.0.96 nvidia-cufft-cu11==10.9.0.58 nvidia-curand-cu11==10.2.10.91 nvidia-cusolver-cu11==11.4.0.1 nvidia-cusparse-cu11==11.7.4.91 nvidia-nccl-cu11==2.14.3 nvidia-nvtx-cu11==11.7.91 packaging==23.1 psutil==5.9.5 PyYAML==6.0 regex==2023.3.23 requests==2.28.2 sentencepiece==0.1.98 sympy==1.11.1 tokenizers==0.13.3 torch==2.0.0 tqdm==4.65.0 transformers==4.28.1 triton==2.0.0 typing_extensions==4.5.0 urllib3==1.26.15
Who can help?
@ArthurZucker @younesbelkada
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
Used following command to convert llama-13 weights into hf.
python src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir /home/unconveretd-weights --model_size 13B --output_dir /home/test-converted
Expected behavior
It should generated the converted weights. But instead it is generating this error
Loading the checkpoint in a Llama model.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [00:17<00:00, 2.35it/s]
Saving in the Transformers format.
Saving a LlamaTokenizerFast to /home/test-converted.
Traceback (most recent call last):
File "/home/transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py", line 278, in
facing the same issue.
Hey! Thanks for reporting I'll investigate this!
I have the same issue when I use the latest version of torch.
I did not find the solution. but if someone wants to download the weights. following link has all the versions.
https://huggingface.co/elinas
Okay, We update the conversion script, which should have fixed most issues. I downloaded the tokenizer model, and re-tried the conversion, and I did not have any issue. Make sure you are using the latest transformers version.
I tried with the latest code from the main branch, but still getting the same issue
I am getting the same error message when running the conversion for the 7B model. Tried installing the latest version (4.29.2) but the error persists. Same traceback as @dittops but mine has a nicer formatting.
Again, the issue is most probably with the tokenizer file that you are using, which is outdated. Yes you need to upgrade to the latest transformers version, but you also need to use the original sentencepiece model in order for the conversion to properly work!
Thanks for following up. I have the llama weights/tokenizer that were updated on 3/26/23. Isn't that the latest version of the tokenizer?
Also I'm not sure what you mean by the original sentencepiece model (unless you mean the model from prior to the 3/26 update).
When you say:
I have the llama weights/tokenizer that were updated on 3/26/23
do you mean the META weights and tokenizer? Otherwise can you share a notebook with a reproducer? The issue with llama is that a PR was made too early and thus lots of checkpoints and previous tokenizers (meaning hf tokenizers json) are incorrect.
@ArthurZucker I have the META weights and tokenizer. The issue share is with that. For sentencepiece, is there a specific version to be used?
I have the llama weights/tokenizer that were updated on 3/26/23
do you mean the META weights and tokenizer? Otherwise can you share a notebook with a reproducer? The issue with llama is that a PR was made too early and thus lots of checkpoints and previous tokenizers (meaning hf tokenizers json) are incorrect.
Ah I see. The llama weights I have come from Meta's torrent PR. I did not get them from HuggingFace, if you are referring to this PR.
Ok 👍🏻 I'll give it another go, but I remember trying with those exact weights and getting a correct conversion. Will get back to you soon!
Would you mind sending me the file via google drive? The torrent link seems down
The torrent is showing as up for me right now, but if it isn't working for you I am happy to send you a copy of the 7B folder I am using. The entire folder for the 7B model is ~13-14GB. I'm trying to compress it right now but it will take a little bit to finish.
Just the tokenizer files are enough!
Email sent!
@egoetz where you able to solve this issue?
@egoetz told me that installing GIT LFS + using the tokenizer at huggyllama/llama-7b
worked.
I received the email but could not access files as they were not shared using drive but a private mail provider 😅
If you are trying to convert the original model (by that I mean going from the spm model to transformers) make sure you have the latest version of transformers
I was able to resolve it by replacing tokenizer.model
with one from hugging face. Thank you/
I'm not sure I understand. If you are trying to convert a checkpoint/tokenizer, then you don't need to use an already converted one. The script is to go from the original tokenizer to the HF format.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.