llama-recipes
llama-recipes copied to clipboard
llama 3 multilingual recipe
🚀 The feature, motivation and pitch
The current multilingual recipes are for LLAMA 2. I would like to see LLAMA 3 multilingual recipes added.
Thank you.
Alternatives
No response
Additional context
Adding multilingual tokens via huggingface tokenizer does not work.
I followed the documentation below. https://huggingface.co/learn/nlp-course/chapter6/2
@woohwan thanks for the feature request, just note that is the e2e recipe more geared toward showing the process. I wonder if you are interested in contributing a llama3 case-study?
sorry. i'm newbie in llm field.
@ HamidShojanazeri I am also interested in merging the llama 3 tokenizer with a new custom tokenizer that I trained from scratch. I understand that llama 1 and 2 tokenizers are based on sentencepiece and the current llama recipes also provide code to merge two sentencepiece tokenizers. However, llama 3 tokenizer is based on tiktoken and there no official training script available to train a tiktoken tokenizer let alone merge two of them together. Can you help with the code or point in the right direction as to how to merge two tiktoken based tokenizers? Thanks in advance
Hi! Here is the recipe for multilingual! Please take a look and let me know if there is any questions!