axolotl RuntimeError: Error(s) in loading state_dict for MistralForCausalLM (Deepspeed Zero 3)

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I fine-tune a Mistral model with the default zero3.json and

Training finishes without error. Afterwards, I expect to be able to load the fine-tuned model using

 model = transformers.AutoModelForCausalLM.from_pretrained('test')

My accelerate config is

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_config_file: deepspeed/zero3.json
  zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Current behaviour

 model = transformers.AutoModelForCausalLM.from_pretrained('test')

yields the error

You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model initialized on CPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Traceback (most recent call last):
  File "/lustre/home/rolmedo/lllm/longcontext/evaluate_winning_single.py", line 159, in <module>
    tokenizer, model = load_tokenizer_model(args.model_dir, use_flash_attention_2=True)
  File "/lustre/home/rolmedo/lllm/longcontext/evaluate_winning_single.py", line 21, in load_tokenizer_model
    model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained
    return model_class.from_pretrained(
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3307, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3756, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for MistralForCausalLM:
	size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32002, 4096]).
	You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

and

  model = transformers.AutoModelForCausalLM.from_pretrained('test',
                                                            device_map='auto',
                                                            torch_dtype=torch.bfloat16,
                                                            trust_remote_code=True,
                                                            low_cpu_mem_usage=True)

yields the error

Traceback (most recent call last):
  File "/lustre/home/rolmedo/lllm/longcontext/evaluate_winning_single.py", line 159, in <module>
    tokenizer, model = load_tokenizer_model(args.model_dir, use_flash_attention_2=True)
  File "/lustre/home/rolmedo/lllm/longcontext/evaluate_winning_single.py", line 21, in load_tokenizer_model
    model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained
    return model_class.from_pretrained(
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3307, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3695, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/modeling_utils.py", line 741, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 285, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([0]) in "weight" (which has shape torch.Size([32002, 4096])), this look incorrect.

Steps to reproduce

accelerate launch -m axolotl.cli.train mistral_config.yml  --deepspeed deepspeed/zero3.json

and thereafter

 model = transformers.AutoModelForCausalLM.from_pretrained('test')

Config yaml

base_model: model_dir/mistral-7b-v0.1/
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
    - path: dset_dir/slim-orca/slim-orca.jsonl
      type: sharegpt
      ds_type: json
      conversation: chatml
      
dataset_prepared_path: prep-datasets/
val_set_size: 0
output_dir: test/
sequence_len: 8192 
sample_packing: true
pad_to_sequence_len: true

wandb_project: orca
wandb_entity:
wandb_watch:
wandb_run_id: mistral-slimorca
wandb_log_model:

gradient_accumulation_steps: 1
micro_batch_size: 6
num_epochs: 4
optimizer: adamw_torch_fused
adam_beta1: 0.9
adam_beta2: 0.95
adam_epsilon: 0.00001
max_grad_norm: 1.0 # gradient clipping max norm
lr_scheduler: cosine
learning_rate: 0.00002

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 100
eval_steps: 0
eval_table_size:
eval_table_max_new_tokens:
save_steps: 0.9999
debug:
deepspeed:
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "<|im_end|>"
  unk_token: "<unk>"
tokens:
  - "<|im_start|>"
  - "<|im_end|>"

Possible solution

Seems related to #705 and #709

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.10

axolotl branch-commit

main/3e3229e2d99bb509784ac72e6589f8a8e406247f

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[x] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

Dec 10 '23 15:12 RicardoDominguez

Are you using a model from a checkpoint folder or the output folder?

Dec 10 '23 16:12 winglian

From the output folder

  File "<stdin>", line 1, in <module>
  File "/lustre/home/rolmedo/axo/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/lustre/home/rolmedo/axo/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3480, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/lustre/home/rolmedo/axo/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3931, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for MistralForCausalLM:
	size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32002, 4096]).
	You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

Dec 17 '23 23:12 RicardoDominguez

I can confirm that I only experience this issue when using Zero3, and Zero 2 works fine.

Dec 18 '23 02:12 RicardoDominguez

I can confirm that I only experience this issue when using Zero3, and Zero 2 works fine.

I just ran into the same error, can confirm switching from zero3 to zero2 "solved" the issue.

Jan 13 '24 10:01 maxidl

Using transformers @ git+https://github.com/huggingface/transformers.git@3cefac1d974db5e2825a0cb2b842883a628be7a0 seems to work.

Feb 01 '24 19:02 mgoulao

Using transformers @ git+https://github.com/huggingface/transformers.git@3cefac1d974db5e2825a0cb2b842883a628be7a0 seems to work.

@mgoulao is this a transformers regression then? That particular commit works with zero3 ?

Feb 01 '24 19:02 winglian

Yes, it does work with ZeRO 3 however you will get this problem: #1035

Feb 02 '24 11:02 mgoulao

I had the same error, the transformer library fixes it, but now I get this one.

new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 813, in _load_state_dict_into_meta_model set_module_quantized_tensor_to_device(model, param_name, param_device, value=param) File "/usr/local/lib/python3.10/dist-packages/transformers/integrations/bitsandbytes.py", line 128, in set_module_quantized_tensor_to_device new_value = value.to(device) NotImplementedError: Cannot copy out of meta tensor; no data!

Feb 02 '24 14:02 luijait

I can confirm the same error when finetuning Mistral with chatml format and deepspeed3.

loading model
Traceback (most recent call last):
  File "/home/ubuntu/llm_recipes/scripts/push2hub.py", line 33, in <module>
    model = AutoModelForCausalLM.from_pretrained(config.model_path, torch_dtype=getattr(torch, config.torch_dtype))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3977, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for MistralForCausalLM:
        size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32002, 4096]).
        You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

Mar 09 '24 13:03 tcapelle

I can confirm the same error when finetuning Mistral with chatml format and deepspeed3.
loading model
Traceback (most recent call last):
  File "/home/ubuntu/llm_recipes/scripts/push2hub.py", line 33, in <module>
    model = AutoModelForCausalLM.from_pretrained(config.model_path, torch_dtype=getattr(torch, config.torch_dtype))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3977, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for MistralForCausalLM:
        size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32002, 4096]).
        You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
The post is old, I think there is no solution, you simply cannot use Qlora + DeepSpeed3 Zero. Fortunately, there is now a quite good alternative that has been recently implemented in Axolotl, which involves FSDP (Full Shard + Qlora). Link

The solution I found most viable was to use a non-quantized Lora with DeepSpeed 3.

Apart from that, I believe that as of today, there is no way with DeepSpeed Stage 3 to load Qloras.

I hope I'm wrong, but all the final answers I found on the internet were basically these.

This issue is about full finetune, no lora involved.

Mar 09 '24 13:03 maxidl

I am doing full tine tune, no qlora.

Mar 10 '24 14:03 tcapelle

+1 Zero3_bf16 + Full-finetune

RuntimeError: Error(s) in loading state_dict for MistralModel:
	size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32006, 4096]).
	You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

EDIT - Can confirm zero2 works

Mar 13 '24 04:03 0-hero

I encountered this, although mine was with llama 3 + zero3. The model safe tensors were being output as shards, but there was also a model.safetensors that HF seems to load by default, even though it's not included in the index.json. Once I (re)moved the model.safetensors file the model seems to have loaded successfully.

Apr 28 '24 14:04 JCRPaquin

axolotl axolotl copied to clipboard

RuntimeError: Error(s) in loading state_dict for MistralForCausalLM (Deepspeed Zero 3)

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

axolotl
axolotl copied to clipboard