axolotl
axolotl copied to clipboard
RuntimeError: Error(s) in loading state_dict for MistralForCausalLM (Deepspeed Zero 3)
Please check that this issue hasn't been reported before.
- [X] I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
I fine-tune a Mistral model with the default zero3.json and
Training finishes without error. Afterwards, I expect to be able to load the fine-tuned model using
model = transformers.AutoModelForCausalLM.from_pretrained('test')
My accelerate config is
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_config_file: deepspeed/zero3.json
zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Current behaviour
model = transformers.AutoModelForCausalLM.from_pretrained('test')
yields the error
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model initialized on CPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Traceback (most recent call last):
File "/lustre/home/rolmedo/lllm/longcontext/evaluate_winning_single.py", line 159, in <module>
tokenizer, model = load_tokenizer_model(args.model_dir, use_flash_attention_2=True)
File "/lustre/home/rolmedo/lllm/longcontext/evaluate_winning_single.py", line 21, in load_tokenizer_model
model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained
return model_class.from_pretrained(
File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3307, in from_pretrained
) = cls._load_pretrained_model(
File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3756, in _load_pretrained_model
raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for MistralForCausalLM:
size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32002, 4096]).
You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
and
model = transformers.AutoModelForCausalLM.from_pretrained('test',
device_map='auto',
torch_dtype=torch.bfloat16,
trust_remote_code=True,
low_cpu_mem_usage=True)
yields the error
Traceback (most recent call last):
File "/lustre/home/rolmedo/lllm/longcontext/evaluate_winning_single.py", line 159, in <module>
tokenizer, model = load_tokenizer_model(args.model_dir, use_flash_attention_2=True)
File "/lustre/home/rolmedo/lllm/longcontext/evaluate_winning_single.py", line 21, in load_tokenizer_model
model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained
return model_class.from_pretrained(
File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3307, in from_pretrained
) = cls._load_pretrained_model(
File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3695, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/modeling_utils.py", line 741, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 285, in set_module_tensor_to_device
raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([0]) in "weight" (which has shape torch.Size([32002, 4096])), this look incorrect.
Steps to reproduce
accelerate launch -m axolotl.cli.train mistral_config.yml --deepspeed deepspeed/zero3.json
and thereafter
model = transformers.AutoModelForCausalLM.from_pretrained('test')
Config yaml
base_model: model_dir/mistral-7b-v0.1/
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true
load_in_8bit: false
load_in_4bit: false
strict: false
datasets:
- path: dset_dir/slim-orca/slim-orca.jsonl
type: sharegpt
ds_type: json
conversation: chatml
dataset_prepared_path: prep-datasets/
val_set_size: 0
output_dir: test/
sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true
wandb_project: orca
wandb_entity:
wandb_watch:
wandb_run_id: mistral-slimorca
wandb_log_model:
gradient_accumulation_steps: 1
micro_batch_size: 6
num_epochs: 4
optimizer: adamw_torch_fused
adam_beta1: 0.9
adam_beta2: 0.95
adam_epsilon: 0.00001
max_grad_norm: 1.0 # gradient clipping max norm
lr_scheduler: cosine
learning_rate: 0.00002
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
warmup_steps: 100
eval_steps: 0
eval_table_size:
eval_table_max_new_tokens:
save_steps: 0.9999
debug:
deepspeed:
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:
bos_token: "<s>"
eos_token: "<|im_end|>"
unk_token: "<unk>"
tokens:
- "<|im_start|>"
- "<|im_end|>"
Possible solution
Seems related to #705 and #709
Which Operating Systems are you using?
- [X] Linux
- [ ] macOS
- [ ] Windows
Python Version
3.10
axolotl branch-commit
main/3e3229e2d99bb509784ac72e6589f8a8e406247f
Acknowledgements
- [X] My issue title is concise, descriptive, and in title casing.
- [x] I have searched the existing issues to make sure this bug has not been reported yet.
- [X] I am using the latest version of axolotl.
- [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.
Are you using a model from a checkpoint folder or the output folder?
From the output folder
File "<stdin>", line 1, in <module>
File "/lustre/home/rolmedo/axo/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
return model_class.from_pretrained(
File "/lustre/home/rolmedo/axo/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3480, in from_pretrained
) = cls._load_pretrained_model(
File "/lustre/home/rolmedo/axo/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3931, in _load_pretrained_model
raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for MistralForCausalLM:
size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32002, 4096]).
You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
I can confirm that I only experience this issue when using Zero3, and Zero 2 works fine.
I can confirm that I only experience this issue when using Zero3, and Zero 2 works fine.
I just ran into the same error, can confirm switching from zero3 to zero2 "solved" the issue.
Using transformers @ git+https://github.com/huggingface/transformers.git@3cefac1d974db5e2825a0cb2b842883a628be7a0
seems to work.
Using
transformers @ git+https://github.com/huggingface/transformers.git@3cefac1d974db5e2825a0cb2b842883a628be7a0
seems to work.
@mgoulao is this a transformers regression then? That particular commit works with zero3 ?
Yes, it does work with ZeRO 3 however you will get this problem: #1035
I had the same error, the transformer library fixes it, but now I get this one.
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 813, in _load_state_dict_into_meta_model set_module_quantized_tensor_to_device(model, param_name, param_device, value=param) File "/usr/local/lib/python3.10/dist-packages/transformers/integrations/bitsandbytes.py", line 128, in set_module_quantized_tensor_to_device new_value = value.to(device) NotImplementedError: Cannot copy out of meta tensor; no data!
I can confirm the same error when finetuning Mistral with chatml format and deepspeed3.
loading model
Traceback (most recent call last):
File "/home/ubuntu/llm_recipes/scripts/push2hub.py", line 33, in <module>
model = AutoModelForCausalLM.from_pretrained(config.model_path, torch_dtype=getattr(torch, config.torch_dtype))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
) = cls._load_pretrained_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3977, in _load_pretrained_model
raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for MistralForCausalLM:
size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32002, 4096]).
You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
I can confirm the same error when finetuning Mistral with chatml format and deepspeed3.
loading model Traceback (most recent call last): File "/home/ubuntu/llm_recipes/scripts/push2hub.py", line 33, in <module> model = AutoModelForCausalLM.from_pretrained(config.model_path, torch_dtype=getattr(torch, config.torch_dtype)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained return model_class.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained ) = cls._load_pretrained_model( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3977, in _load_pretrained_model raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}") RuntimeError: Error(s) in loading state_dict for MistralForCausalLM: size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32002, 4096]). You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
The post is old, I think there is no solution, you simply cannot use Qlora + DeepSpeed3 Zero. Fortunately, there is now a quite good alternative that has been recently implemented in Axolotl, which involves FSDP (Full Shard + Qlora). Link
The solution I found most viable was to use a non-quantized Lora with DeepSpeed 3.
Apart from that, I believe that as of today, there is no way with DeepSpeed Stage 3 to load Qloras.
I hope I'm wrong, but all the final answers I found on the internet were basically these.
This issue is about full finetune, no lora involved.
I am doing full tine tune, no qlora.
+1 Zero3_bf16 + Full-finetune
RuntimeError: Error(s) in loading state_dict for MistralModel:
size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32006, 4096]).
You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
EDIT - Can confirm zero2 works
I encountered this, although mine was with llama 3 + zero3. The model safe tensors were being output as shards, but there was also a model.safetensors
that HF seems to load by default, even though it's not included in the index.json. Once I (re)moved the model.safetensors
file the model seems to have loaded successfully.