DeepSpeed RuntimeError: Error(s) in loading state

RuntimeError: Error(s) in loading state_dict

Open lxd551326 opened this issue 1 year ago • 5 comments

Describe the bug i can only use pytorch to training model with Qwen1.5-7B. but when i use deepSpeed i got a problem with CUDA out of memory my config with zeRo2 is: { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } },

"scheduler": {
    "type": "WarmupLR",
    "params": {
        "warmup_min_lr": "auto",
        "warmup_max_lr": "auto",
        "warmup_num_steps": "auto"
    }
},

"zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
        "device": "none",
        "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true
},

"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 100,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false

} then i had change param with "offload_optimizer": { "device": "cpu", "pin_memory": true }, then it has an other problem like this:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/ai/mydata/models/Qwen1.5-main/examples/sft/finetune.py", line 378, in <module>
[rank1]:     train()
[rank1]:   File "/home/ai/mydata/models/Qwen1.5-main/examples/sft/finetune.py", line 367, in train
[rank1]:     trainer.train(resume_from_checkpoint=True)
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train
[rank1]:     return inner_training_loop(
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2063, in _inner_training_loop
[rank1]:     deepspeed_load_checkpoint(
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 432, in deepspeed_load_checkpoint
[rank1]:     load_path, _ = deepspeed_engine.load_checkpoint(
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2764, in load_checkpoint
[rank1]:     load_path, client_states = self._load_checkpoint(load_dir,
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2847, in _load_checkpoint
[rank1]:     self.load_module_state_dict(checkpoint=checkpoint,
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2627, in load_module_state_dict
[rank1]:     self.module.load_state_dict(
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2189, in load_state_dict
[rank1]:     raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
[rank1]: RuntimeError: Error(s) in loading state_dict for Qwen2ForCausalLM:
[rank1]:        Missing key(s) in state_dict: "model.layers.24.self_attn.q_proj.weight", "model.layers.24.self_attn.q_proj.bias", "model.layers.24.self_attn.k_proj.weight", "model.layers.24.self_attn.k_proj.bias", "model.layers.24.self_attn.v_proj.weight", "model.layers.24.self_attn.v_proj.bias", "model.layers.24.self_attn.o_proj.weight", "model.layers.24.mlp.gate_proj.weight", "model.layers.24.mlp.up_proj.weight", "model.layers.24.mlp.down_proj.weight", "model.layers.24.input_layernorm.weight", "model.layers.24.post_attention_layernorm.weight", "model.layers.25.self_attn.q_proj.weight", "model.layers.25.self_attn.q_proj.bias", "model.layers.25.self_attn.k_proj.weight", "model.layers.25.self_attn.k_proj.bias", "model.layers.25.self_attn.v_proj.weight", "model.layers.25.self_attn.v_proj.bias", "model.layers.25.self_attn.o_proj.weight", "model.layers.25.mlp.gate_proj.weight", "model.layers.25.mlp.up_proj.weight", "model.layers.25.mlp.down_proj.weight", "model.layers.25.input_layernorm.weight", "model.layers.25.post_attention_layernorm.weight", "model.layers.26.self_attn.q_proj.weight", "model.layers.26.self_attn.q_proj.bias", "model.layers.26.self_attn.k_proj.weight", "model.layers.26.self_attn.k_proj.bias", "model.layers.26.self_attn.v_proj.weight", "model.layers.26.self_attn.v_proj.bias", "model.layers.26.self_attn.o_proj.weight", "model.layers.26.mlp.gate_proj.weight", "model.layers.26.mlp.up_proj.weight", "model.layers.26.mlp.down_proj.weight", "model.layers.26.input_layernorm.weight", "model.layers.26.post_attention_layernorm.weight", "model.layers.27.self_attn.q_proj.weight", "model.layers.27.self_attn.q_proj.bias", "model.layers.27.self_attn.k_proj.weight", "model.layers.27.self_attn.k_proj.bias", "model.layers.27.self_attn.v_proj.weight", "model.layers.27.self_attn.v_proj.bias", "model.layers.27.self_attn.o_proj.weight", "model.layers.27.mlp.gate_proj.weight", "model.layers.27.mlp.up_proj.weight", "model.layers.27.mlp.down_proj.weight", "model.layers.27.input_layernorm.weight", "model.layers.27.post_attention_layernorm.weight", "model.layers.28.self_attn.q_proj.weight", "model.layers.28.self_attn.q_proj.bias", "model.layers.28.self_attn.k_proj.weight", "model.layers.28.self_attn.k_proj.bias", "model.layers.28.self_attn.v_proj.weight", "model.layers.28.self_attn.v_proj.bias", "model.layers.28.self_attn.o_proj.weight", "model.layers.28.mlp.gate_proj.weight", "model.layers.28.mlp.up_proj.weight", "model.layers.28.mlp.down_proj.weight", "model.layers.28.input_layernorm.weight", "model.layers.28.post_attention_layernorm.weight", "model.layers.29.self_attn.q_proj.weight", "model.layers.29.self_attn.q_proj.bias", "model.layers.29.self_attn.k_proj.weight", "model.layers.29.self_attn.k_proj.bias", "model.layers.29.self_attn.v_proj.weight", "model.layers.29.self_attn.v_proj.bias", "model.layers.29.self_attn.o_proj.weight", "model.layers.29.mlp.gate_proj.weight", "model.layers.29.mlp.up_proj.weight", "model.layers.29.mlp.down_proj.weight", "model.layers.29.input_layernorm.weight", "model.layers.29.post_attention_layernorm.weight", "model.layers.30.self_attn.q_proj.weight", "model.layers.30.self_attn.q_proj.bias", "model.layers.30.self_attn.k_proj.weight", "model.layers.30.self_attn.k_proj.bias", "model.layers.30.self_attn.v_proj.weight", "model.layers.30.self_attn.v_proj.bias", "model.layers.30.self_attn.o_proj.weight", "model.layers.30.mlp.gate_proj.weight", "model.layers.30.mlp.up_proj.weight", "model.layers.30.mlp.down_proj.weight", "model.layers.30.input_layernorm.weight", "model.layers.30.post_attention_layernorm.weight", "model.layers.31.self_attn.q_proj.weight", "model.layers.31.self_attn.q_proj.bias", "model.layers.31.self_attn.k_proj.weight", "model.layers.31.self_attn.k_proj.bias", "model.layers.31.self_attn.v_proj.weight", "model.layers.31.self_attn.v_proj.bias", "model.layers.31.self_attn.o_proj.weight", "model.layers.31.mlp.gate_proj.weight", "model.layers.31.mlp.up_proj.weight", "model.layers.31.mlp.down_proj.weight", "model.layers.31.input_layernorm.weight", "model.layers.31.post_attention_layernorm.weight". 
[rank1]:        size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([151936, 1024]) from checkpoint, the shape in current model is torch.Size([151936, 4096]).
[rank1]:        size mismatch for model.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.0.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.0.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.0.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.0.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.0.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.0.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.0.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.0.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.0.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.1.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.1.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.1.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.1.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.1.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.1.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.1.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.1.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.1.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.1.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.1.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.1.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.2.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.2.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.2.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.2.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.2.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.2.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.2.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.2.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.2.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.2.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.2.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.2.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.3.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.3.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.3.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.3.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.3.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.3.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.3.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.3.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.3.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.3.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.3.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.3.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.4.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.4.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.4.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.4.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.4.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.4.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.4.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.4.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.4.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.4.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.4.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.4.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.5.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.5.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.5.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.5.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.5.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.5.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.5.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.5.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.5.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.5.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.5.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.5.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.6.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.6.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.6.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.6.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.6.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.6.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.6.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.6.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.6.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.6.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.6.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.6.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.7.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.7.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.7.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.7.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.7.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.7.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.7.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.7.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.7.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.7.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.7.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.7.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.8.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.8.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.8.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.8.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.8.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.8.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.8.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.8.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.8.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.8.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.8.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.8.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.9.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.9.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.9.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.9.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.9.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.9.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.9.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.9.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.9.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.9.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.9.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.9.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.10.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.10.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.10.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.10.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.10.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.10.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.10.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.10.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.10.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.10.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.10.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.10.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.11.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.11.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.11.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.11.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.11.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.11.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.11.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.11.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.11.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.11.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.11.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.11.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.12.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.12.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.12.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.12.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.12.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.12.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.12.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.12.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.12.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.12.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.12.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.12.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.13.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.13.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.13.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.13.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.13.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.13.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.13.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.13.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.13.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.13.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.13.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.13.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.14.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.14.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.14.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.14.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.14.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.14.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.14.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.14.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.14.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.14.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.14.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.14.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.15.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.15.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.15.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.15.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.15.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.15.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.15.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.15.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.15.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.15.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.15.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.15.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.16.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.16.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.16.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.16.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.16.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.16.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.16.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.16.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.16.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.16.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.16.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.16.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.17.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.17.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.17.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.17.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.17.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.17.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.17.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.17.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.17.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.17.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.17.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.17.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.18.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.18.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.18.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.18.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.18.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.18.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.18.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.18.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.18.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.18.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.18.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.18.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.19.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.19.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.19.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.19.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.19.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.19.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.19.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.19.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.19.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.19.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.19.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.19.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.20.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.20.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.20.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.20.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.20.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.20.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.20.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.20.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.20.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.20.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.20.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.20.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.21.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.21.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.21.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.21.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.21.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.21.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.21.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.21.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.21.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.21.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.21.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.21.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.22.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.22.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.22.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.22.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.22.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.22.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.22.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.22.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.22.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.22.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.22.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.22.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.23.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.23.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.23.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.23.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.23.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.23.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.23.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.23.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.23.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.23.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.23.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.23.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.norm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for lm_head.weight: copying a param with shape torch.Size([151936, 1024]) from checkpoint, the shape in current model is torch.Size([151936, 4096]).
W0527 01:21:26.844000 140472356030272 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1056309 closing signal SIGTERM
W0527 01:21:26.844000 140472356030272 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1056310 closing signal SIGTERM

May 27 '24 01:05 lxd551326

I have the same question in deepspeed stage3 but for the shape in current model is torch.Size([0]), please someone help us. T_T

Jun 15 '24 08:06 DavidYanAnDe

@lxd551326, it seems you seeing two different issues.

CUDA OOM using DeepSpeed for a model that works with pure pytorch is very strange and should be investigated. Can you provide more repro details for that?
The checkpoint loading problem seems to be due to a mismatch between checkpoint and model definition. Can you check that it works with pytorch only?

For both above cases, it would be very helpful if you provide repro steps?

Jun 15 '24 19:06 tjruwase

DeepSpeed for a model that works with pure pytorch is very stran

Have you solved the problem? I meet it too. The shape is correct in my program without using deepspeed.

Jul 16 '24 14:07 lhyscau

@lhyscau, @DavidYanAnDe, and @lxd551326 are you able to provide repro steps?

Aug 03 '24 17:08 tjruwase

@lhyscau, @DavidYanAnDe, and @lxd551326 are you able to provide repro steps?

I comment the zero_optimizer param in the ds.config file, then the error doesn't happen.

Aug 04 '24 05:08 lhyscau

@lhyscau, @DavidYanAnDe, and @lxd551326 are you able to provide repro steps?

I comment the zero_optimizer param in the ds.config file, then the error doesn't happen.

May I see your ds.config file? Thx

Nov 30 '24 16:11 PingchengDong

DeepSpeed DeepSpeed copied to clipboard

RuntimeError: Error(s) in loading state_dict

DeepSpeed
DeepSpeed copied to clipboard