codellama icon indicating copy to clipboard operation
codellama copied to clipboard

Is download.sh providing the correct tokenizer.model files?

Open dveb8886 opened this issue 1 year ago • 1 comments

When I try to run a model ..

torchrun example_js.py \
    --ckpt_dir CodeLlama-13b-Instruct \
    --tokenizer_path CodeLlama-13b-Instruct/tokenizer.model \
    --max_seq_len 1024 --max_batch_size 4 --nproc_per_node 2

example_js is the same as the provide example_completion, but with different prompts

... I get this error:

RuntimeError: Error(s) in loading state_dict for Transformer:
	size mismatch for tok_embeddings.weight: copying a param with shape torch.Size([32016, 2560]) from checkpoint, the shape in current model is torch.Size([32000, 5120]).
	size mismatch for layers.0.attention.wq.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([5120, 5120]).
	size mismatch for layers.0.attention.wk.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([5120, 5120]).
	size mismatch for layers.0.attention.wv.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([5120, 5120]).
	size mismatch for layers.0.attention.wo.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([5120, 5120]).
...

This code works perfectly fine if I use the 7b model and tokenizer

Investigating a bit further, I noticed this:

md5sum CodeLlama-7b-Instruct/tokenizer.model
9e597e72392fd4005529a33f2bf708ba  CodeLlama-7b-Instruct/tokenizer.model
md5sum CodeLlama-13b-Instruct/tokenizer.model
9e597e72392fd4005529a33f2bf708ba  CodeLlama-13b-Instruct/tokenizer.model
md5sum CodeLlama-34b-Instruct/tokenizer.model
eeec4125e9c7560836b4873b6f8e3025  CodeLlama-34b-Instruct/tokenizer.model

the tokenizer for 7b and 13b are identical? That seems unlikely.

I also attempted these variants of torchrun just to see what happens

torchrun --ckpt_dir CodeLlama-13b-Instruct --tokenizer_path CodeLlama-34b-Instruct/tokenizer.model 
torchrun --ckpt_dir CodeLlama-34b-Instruct --tokenizer_path CodeLlama-34b-Instruct/tokenizer.model --nproc_per_node 4
  • These produced the same errors, but with different numbers

On another node, the --nproc_per_node value is provided to the commands just in case (as the docs say it's needed), but in practice I find it has no effect. I was forced to modify the code that builds the model like so:

generator = Llama.build(
        ckpt_dir=ckpt_dir,
        tokenizer_path=tokenizer_path,
        max_seq_len=max_seq_len,
        max_batch_size=max_batch_size,

       # Added this, value is 2 for 13b and 4 for 34b
        model_parallel_size=2,
    )

I'm on an M1 Macbook Pro with 64 GB of ram

dveb8886 avatar Nov 13 '23 23:11 dveb8886

I encountered the same problem, CodeLlama-7b-Instruct works, but CodeLlama-13b-Instruct and CodeLlama-34b-Instruct failed. I manually set model_parallel_size=3 for 13b and 4 for 34b, still get size mismatch error.

JoshuaChou2018 avatar Dec 12 '23 09:12 JoshuaChou2018

Sorry for replying so late, but just to clarify, the 34b model uses a different tokenizer as it was not trained with fill-in-the-middle capabilities.

For the commands you provided, the --nproc_per_node needs to be passed to torchrun but by appending it to the rest of the command it will be pass to example_js.py instead. The current version of the code will warn you about any model parallel mismatches at runtime. This command works for me:

torchrun --nproc_per_node=2 example_instructions.py \
    --ckpt_dir CodeLlama-13b-Instruct \
    --tokenizer_path CodeLlama-13b-Instruct/tokenizer.model 

jgehring avatar Feb 28 '24 07:02 jgehring