llama-recipes icon indicating copy to clipboard operation
llama-recipes copied to clipboard

llama 3 multilingual recipe

Open woohwan opened this issue 9 months ago • 4 comments

🚀 The feature, motivation and pitch

The current multilingual recipes are for LLAMA 2. I would like to see LLAMA 3 multilingual recipes added.

Thank you.

Alternatives

No response

Additional context

Adding multilingual tokens via huggingface tokenizer does not work.

I followed the documentation below. https://huggingface.co/learn/nlp-course/chapter6/2

woohwan avatar May 13 '24 10:05 woohwan

@woohwan thanks for the feature request, just note that is the e2e recipe more geared toward showing the process. I wonder if you are interested in contributing a llama3 case-study?

HamidShojanazeri avatar May 15 '24 20:05 HamidShojanazeri

sorry. i'm newbie in llm field.

woohwan avatar May 15 '24 23:05 woohwan

@ HamidShojanazeri I am also interested in merging the llama 3 tokenizer with a new custom tokenizer that I trained from scratch. I understand that llama 1 and 2 tokenizers are based on sentencepiece and the current llama recipes also provide code to merge two sentencepiece tokenizers. However, llama 3 tokenizer is based on tiktoken and there no official training script available to train a tiktoken tokenizer let alone merge two of them together. Can you help with the code or point in the right direction as to how to merge two tiktoken based tokenizers? Thanks in advance

savanth14 avatar May 28 '24 09:05 savanth14

Hi! Here is the recipe for multilingual! Please take a look and let me know if there is any questions!

wukaixingxp avatar Jun 03 '24 21:06 wukaixingxp