llama3 icon indicating copy to clipboard operation
llama3 copied to clipboard

Meta-Llama-3-8B-Instruct does not appear to have a file named tokenizer.model

Open THUchenzhou opened this issue 10 months ago • 7 comments

Meta-Llama-3-8B does not appear to have a file named tokenizer.model. How to generate the file of tokenizer.model?

THUchenzhou avatar Apr 19 '24 05:04 THUchenzhou

It's in the original folder. Because the transformers compatible version only needs tokenizer.json 🤗

ArthurZucker avatar Apr 19 '24 06:04 ArthurZucker

Thanks!

THUchenzhou avatar Apr 19 '24 13:04 THUchenzhou

It is in the original folder, but does not seem valid. Any idea?

dejankocic avatar Apr 19 '24 13:04 dejankocic

@dejankocic The Llama 3 tokenizer is different than the one used by Llama 2. It's a BPE tokenizer built with the tiktoken library, whereas Llama 2 used sentencepiece.

pcuenca avatar Apr 19 '24 16:04 pcuenca

@dejankocic The Llama 3 tokenizer is different than the one used by Llama 2. It's a BPE tokenizer built with the tiktoken library, whereas Llama 2 used sentencepiece.

I am fine with everything it is inside the repo I downloaded. The file found in the original repo looks no valid on the first start, I havent changed anything.

dejankocic avatar Apr 19 '24 17:04 dejankocic

It's in the original folder. Because the transformers compatible version only needs tokenizer.json 🤗

It seems the tokenizer.model within the provided directory is encountering issues and fails to load properly. I'm encountering this challenge while attempting to utilize it for training with Megatron-LM. Could you kindly offer a resolution or guidance on how to address this predicament?

SDsly avatar May 10 '24 05:05 SDsly

I have no idea what megatron LM uses to load the tokenizer, but if megatron LM relies on sentencepiece, there is nothing I can do to help as converting anything to a sentencepiece format is pretty much impossible.

ArthurZucker avatar May 10 '24 07:05 ArthurZucker