torchtune generate function error when model used Int4WeightOnlyQATQuantizer

Open elfisworking opened this issue 1 year ago • 1 comments

today, i try to use Int4WeightOnlyQATQuantizer to quantize llama3-8b when i use model generate function, i get below error:

Running InferenceRecipe with resolved config:

chat_format: null
checkpointer:
  _component_: torchtune.training.FullModelTorchTuneCheckpointer
  checkpoint_dir: /QAT/output/llama3-8B/
  checkpoint_files:
  - meta_model_0-4w-qat-module-swap.pt
  model_type: LLAMA3
  output_dir: /QAT/output/llama3-8B/
device: cuda
dtype: bf16
enable_kv_cache: true
instruct_template: null
max_new_tokens: 300
model:
  _component_: torchtune.models.llama3.llama3_8b
prompt: Tell me a joke?
quantizer:
  _component_: torchtune.training.quantization.Int4WeightOnlyQuantizer
  groupsize: 256
seed: 42
temperature: 0.6
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  max_seq_len: null
  path: /QAT/Meta-Llama-3-8B/original/tokenizer.model
top_k: 1

Setting manual seed to local seed 42. Local seed is seed + rank = 42 + 0
Traceback (most recent call last):
  File "/usr/local/bin/tune", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torchtune/_cli/tune.py", line 49, in main
    parser.run(args)
  File "/usr/local/lib/python3.10/dist-packages/torchtune/_cli/tune.py", line 43, in run
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/torchtune/_cli/run.py", line 208, in _run_cmd
    self._run_single_device(args, is_builtin=is_builtin)
  File "/usr/local/lib/python3.10/dist-packages/torchtune/_cli/run.py", line 102, in _run_single_device
    runpy.run_path(str(args.recipe), run_name="__main__")
  File "/usr/lib/python3.10/runpy.py", line 289, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/recipes/generate.py", line 229, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torchtune/config/_parse.py", line 99, in wrapper
    sys.exit(recipe_main(conf))
  File "/usr/local/lib/python3.10/dist-packages/recipes/generate.py", line 224, in main
    recipe.setup(cfg=cfg)
  File "/usr/local/lib/python3.10/dist-packages/recipes/generate.py", line 70, in setup
    self._model = self._setup_model(
  File "/usr/local/lib/python3.10/dist-packages/recipes/generate.py", line 89, in _setup_model
    model.load_state_dict(model_state_dict, assign=True)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2584, in load_state_dict
    raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for TransformerDecoder:
        Unexpected key(s) in state_dict: "layers.0.attn.q_proj.scales_and_zeros", "layers.0.attn.k_proj.scales_and_zeros", "layers.0.attn.v_proj.scales_and_zeros", "layers.0.attn.output_proj.scales_and_zeros", "layers.0.mlp.w1.scales_and_zeros", "layers.0.mlp.w2.scales_and_zeros", "layers.0.mlp.w3.scales_and_zeros",
"layers.1.attn.q_proj.scales_and_zeros", "layers.1.attn.k_proj.scales_and_zeros", "layers.1.attn.v_proj.scales_and_zeros", "layers.1.attn.output_proj.scales_and_zeros", "layers.1.mlp.w1.scales_and_zeros", "layers.1.mlp.w2.scales_and_zeros", "layers.1.mlp.w3.scales_and_zeros", "layers.2.attn.q_proj.scales_and_zeros",
"layers.2.attn.k_proj.scales_and_zeros", "layers.2.attn.v_proj.scales_and_zeros", "layers.2.attn.output_proj.scales_and_zeros", "layers.2.mlp.w1.scales_and_zeros", "layers.2.mlp.w2.scales_and_zeros", "layers.2.mlp.w3.scales_and_zeros", "layers.3.attn.q_proj.scales_and_zeros", "layers.3.attn.k_proj.scales_and_zeros",
"layers.3.attn.v_proj.scales_and_zeros", "layers.3.attn.output_proj.scales_and_zeros", "layers.3.mlp.w1.scales_and_zeros", "layers.3.mlp.w2.scales_and_zeros", "layers.3.mlp.w3.scales_and_zeros", "layers.4.attn.q_proj.scales_and_zeros", "layers.4.attn.k_proj.scales_and_zeros", "layers.4.attn.v_proj.scales_and_zeros",
"layers.4.attn.output_proj.scales_and_zeros", "layers.4.mlp.w1.scales_and_zeros", "layers.4.mlp.w2.scales_and_zeros", "layers.4.mlp.w3.scales_and_zeros", "layers.5.attn.q_proj.scales_and_zeros", "layers.5.attn.k_proj.scales_and_zeros", "layers.5.attn.v_proj.scales_and_zeros", "layers.5.attn.output_proj.scales_and_zer
os", "layers.5.mlp.w1.scales_and_zeros", "layers.5.mlp.w2.scales_and_zeros", "layers.5.mlp.w3.scales_and_zeros", "layers.6.attn.q_proj.scales_and_zeros", "layers.6.attn.k_proj.scales_and_zeros", "layers.6.attn.v_proj.scales_and_zeros", "layers.6.attn.output_proj.scales_and_zeros", "layers.6.mlp.w1.scales_and_zeros",
"layers.6.mlp.w2.scales_and_zeros", "layers.6.mlp.w3.scales_and_zeros", "layers.7.attn.q_proj.scales_and_zeros", "layers.7.attn.k_proj.scales_and_zeros", "layers.7.attn.v_proj.scales_and_zeros", "layers.7.attn.output_proj.scales_and_zeros", "layers.7.mlp.w1.scales_and_zeros", "layers.7.mlp.w2.scales_and_zeros", "laye
rs.7.mlp.w3.scales_and_zeros", "layers.8.attn.q_proj.scales_and_zeros", "layers.8.attn.k_proj.scales_and_zeros", "layers.8.attn.v_proj.scales_and_zeros", "layers.8.attn.output_proj.scales_and_zeros", "layers.8.mlp.w1.scales_and_zeros", "layers.8.mlp.w2.scales_and_zeros", "layers.8.mlp.w3.scales_and_zeros", "layers.9.
attn.q_proj.scales_and_zeros", "layers.9.attn.k_proj.scales_and_zeros", "layers.9.attn.v_proj.scales_and_zeros", "layers.9.attn.output_proj.scales_and_zeros", "layers.9.mlp.w1.scales_and_zeros", "layers.9.mlp.w2.scales_and_zeros", "layers.9.mlp.w3.scales_and_zeros", "layers.10.attn.q_proj.scales_and_zeros", "layers.1
0.attn.k_proj.scales_and_zeros", "layers.10.attn.v_proj.scales_and_zeros", "layers.10.attn.output_proj.scales_and_zeros", "layers.10.mlp.w1.scales_and_zeros", "layers.10.mlp.w2.scales_and_zeros", "layers.10.mlp.w3.scales_and_zeros", "layers.11.attn.q_proj.scales_and_zeros", "layers.11.attn.k_proj.scales_and_zeros", "
layers.11.attn.v_proj.scales_and_zeros", "layers.11.attn.output_proj.scales_and_zeros", "layers.11.mlp.w1.scales_and_zeros", "layers.11.mlp.w2.scales_and_zeros", "layers.11.mlp.w3.scales_and_zeros", "layers.12.attn.q_proj.scales_and_zeros", "layers.12.attn.k_proj.scales_and_zeros", "layers.12.attn.v_proj.scales_and_z
eros", "layers.12.attn.output_proj.scales_and_zeros", "layers.12.mlp.w1.scales_and_zeros", "layers.12.mlp.w2.scales_and_zeros", "layers.12.mlp.w3.scales_and_zeros", "layers.13.attn.q_proj.scales_and_zeros", "layers.13.attn.k_proj.scales_and_zeros", "layers.13.attn.v_proj.scales_and_zeros", "layers.13.attn.output_proj
.scales_and_zeros", "layers.13.mlp.w1.scales_and_zeros", "layers.13.mlp.w2.scales_and_zeros", "layers.13.mlp.w3.scales_and_zeros", "layers.14.attn.q_proj.scales_and_zeros", "layers.14.attn.k_proj.scales_and_zeros", "layers.14.attn.v_proj.scales_and_zeros", "layers.14.attn.output_proj.scales_and_zeros", "layers.14.mlp
.w1.scales_and_zeros", "layers.14.mlp.w2.scales_and_zeros", "layers.14.mlp.w3.scales_and_zeros", "layers.15.attn.q_proj.scales_and_zeros", "layers.15.attn.k_proj.scales_and_zeros", "layers.15.attn.v_proj.scales_and_zeros", "layers.15.attn.output_proj.scales_and_zeros", "layers.15.mlp.w1.scales_and_zeros", "layers.15.
mlp.w2.scales_and_zeros", "layers.15.mlp.w3.scales_and_zeros", "layers.16.attn.q_proj.scales_and_zeros", "layers.16.attn.k_proj.scales_and_zeros", "layers.16.attn.v_proj.scales_and_zeros", "layers.16.attn.output_proj.scales_and_zeros", "layers.16.mlp.w1.scales_and_zeros", "layers.16.mlp.w2.scales_and_zeros", "layers.
16.mlp.w3.scales_and_zeros", "layers.17.attn.q_proj.scales_and_zeros", "layers.17.attn.k_proj.scales_and_zeros", "layers.17.attn.v_proj.scales_and_zeros", "layers.17.attn.output_proj.scales_and_zeros", "layers.17.mlp.w1.scales_and_zeros", "layers.17.mlp.w2.scales_and_zeros", "layers.17.mlp.w3.scales_and_zeros", "laye
rs.18.attn.q_proj.scales_and_zeros", "layers.18.attn.k_proj.scales_and_zeros", "layers.18.attn.v_proj.scales_and_zeros", "layers.18.attn.output_proj.scales_and_zeros", "layers.18.mlp.w1.scales_and_zeros", "layers.18.mlp.w2.scales_and_zeros", "layers.18.mlp.w3.scales_and_zeros", "layers.19.attn.q_proj.scales_and_zeros
", "layers.19.attn.k_proj.scales_and_zeros", "layers.19.attn.v_proj.scales_and_zeros", "layers.19.attn.output_proj.scales_and_zeros", "layers.19.mlp.w1.scales_and_zeros", "layers.19.mlp.w2.scales_and_zeros", "layers.19.mlp.w3.scales_and_zeros", "layers.20.attn.q_proj.scales_and_zeros", "layers.20.attn.k_proj.scales_a
nd_zeros", "layers.20.attn.v_proj.scales_and_zeros", "layers.20.attn.output_proj.scales_and_zeros", "layers.20.mlp.w1.scales_and_zeros", "layers.20.mlp.w2.scales_and_zeros", "layers.20.mlp.w3.scales_and_zeros", "layers.21.attn.q_proj.scales_and_zeros", "layers.21.attn.k_proj.scales_and_zeros", "layers.21.attn.v_proj.
scales_and_zeros", "layers.21.attn.output_proj.scales_and_zeros", "layers.21.mlp.w1.scales_and_zeros", "layers.21.mlp.w2.scales_and_zeros", "layers.21.mlp.w3.scales_and_zeros", "layers.22.attn.q_proj.scales_and_zeros", "layers.22.attn.k_proj.scales_and_zeros", "layers.22.attn.v_proj.scales_and_zeros", "layers.22.attn
.output_proj.scales_and_zeros", "layers.22.mlp.w1.scales_and_zeros", "layers.22.mlp.w2.scales_and_zeros", "layers.22.mlp.w3.scales_and_zeros", "layers.23.attn.q_proj.scales_and_zeros", "layers.23.attn.k_proj.scales_and_zeros", "layers.23.attn.v_proj.scales_and_zeros", "layers.23.attn.output_proj.scales_and_zeros", "l
ayers.23.mlp.w1.scales_and_zeros", "layers.23.mlp.w2.scales_and_zeros", "layers.23.mlp.w3.scales_and_zeros", "layers.24.attn.q_proj.scales_and_zeros", "layers.24.attn.k_proj.scales_and_zeros", "layers.24.attn.v_proj.scales_and_zeros", "layers.24.attn.output_proj.scales_and_zeros", "layers.24.mlp.w1.scales_and_zeros",
 "layers.24.mlp.w2.scales_and_zeros", "layers.24.mlp.w3.scales_and_zeros", "layers.25.attn.q_proj.scales_and_zeros", "layers.25.attn.k_proj.scales_and_zeros", "layers.25.attn.v_proj.scales_and_zeros", "layers.25.attn.output_proj.scales_and_zeros", "layers.25.mlp.w1.scales_and_zeros", "layers.25.mlp.w2.scales_and_zero
s", "layers.25.mlp.w3.scales_and_zeros", "layers.26.attn.q_proj.scales_and_zeros", "layers.26.attn.k_proj.scales_and_zeros", "layers.26.attn.v_proj.scales_and_zeros", "layers.26.attn.output_proj.scales_and_zeros", "layers.26.mlp.w1.scales_and_zeros", "layers.26.mlp.w2.scales_and_zeros", "layers.26.mlp.w3.scales_and_z
eros", "layers.27.attn.q_proj.scales_and_zeros", "layers.27.attn.k_proj.scales_and_zeros", "layers.27.attn.v_proj.scales_and_zeros", "layers.27.attn.output_proj.scales_and_zeros", "layers.27.mlp.w1.scales_and_zeros", "layers.27.mlp.w2.scales_and_zeros", "layers.27.mlp.w3.scales_and_zeros", "layers.28.attn.q_proj.scal
es_and_zeros", "layers.28.attn.k_proj.scales_and_zeros", "layers.28.attn.v_proj.scales_and_zeros", "layers.28.attn.output_proj.scales_and_zeros", "layers.28.mlp.w1.scales_and_zeros", "layers.28.mlp.w2.scales_and_zeros", "layers.28.mlp.w3.scales_and_zeros", "layers.29.attn.q_proj.scales_and_zeros", "layers.29.attn.k_p
roj.scales_and_zeros", "layers.29.attn.v_proj.scales_and_zeros", "layers.29.attn.output_proj.scales_and_zeros", "layers.29.mlp.w1.scales_and_zeros", "layers.29.mlp.w2.scales_and_zeros", "layers.29.mlp.w3.scales_and_zeros", "layers.30.attn.q_proj.scales_and_zeros", "layers.30.attn.k_proj.scales_and_zeros", "layers.30.
attn.v_proj.scales_and_zeros", "layers.30.attn.output_proj.scales_and_zeros", "layers.30.mlp.w1.scales_and_zeros", "layers.30.mlp.w2.scales_and_zeros", "layers.30.mlp.w3.scales_and_zeros", "layers.31.attn.q_proj.scales_and_zeros", "layers.31.attn.k_proj.scales_and_zeros", "layers.31.attn.v_proj.scales_and_zeros", "la
yers.31.attn.output_proj.scales_and_zeros", "layers.31.mlp.w1.scales_and_zeros", "layers.31.mlp.w2.scales_and_zeros", "layers.31.mlp.w3.scales_and_zeros", "output.scales_and_zeros".
        size mismatch for layers.0.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.0.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.0.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.0.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.0.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.0.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.0.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.1.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.1.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.1.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.1.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.1.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.1.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.1.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.2.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.2.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.2.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.2.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.2.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.2.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.2.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.3.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.3.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.3.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.3.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.3.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.3.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.3.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.4.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.4.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.4.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.4.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.4.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.4.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.4.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.5.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.5.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.5.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.5.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.5.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.5.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.5.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.6.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.6.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.6.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.6.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.6.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.6.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.6.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.7.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.7.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.7.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.7.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.7.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.7.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.7.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.8.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.8.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.8.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.8.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.8.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.8.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.8.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.9.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.9.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.9.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.9.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.9.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.9.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.9.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.10.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.10.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.10.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.10.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.10.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.10.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.10.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.11.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.11.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.11.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.11.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.11.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.11.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.11.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.12.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.12.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.12.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.12.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.12.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.12.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.12.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.13.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.13.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.13.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.13.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.13.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.13.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.13.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.14.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.14.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.14.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.14.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.14.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.14.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.14.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.15.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.15.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.15.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.15.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.15.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.15.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.15.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.16.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.16.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.16.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.16.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.16.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.16.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.16.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.17.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.17.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.17.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.17.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.17.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.17.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.17.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.18.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.18.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.18.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.18.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.18.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.18.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.18.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.19.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.19.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.19.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.19.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.19.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.19.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.19.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.20.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.20.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.20.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.20.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.20.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.20.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.20.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.21.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.21.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.21.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.21.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.21.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.21.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.21.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.22.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.22.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.22.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.22.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.22.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.22.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.22.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.23.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.23.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.23.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.23.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.23.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.23.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.23.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.24.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.24.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.24.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.24.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.24.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.24.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.24.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.25.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.25.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.25.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.25.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.25.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.25.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.25.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.26.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.26.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.26.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.26.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.26.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.26.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.26.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.27.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.27.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.27.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.27.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.27.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.27.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.27.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.28.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.28.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.28.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.28.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.28.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.28.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.28.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.29.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.29.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.29.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.29.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.29.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.29.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.29.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.30.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.30.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.30.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.30.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.30.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.30.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.30.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.31.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.31.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.31.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for layers.31.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
        size mismatch for layers.31.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for layers.31.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
        size mismatch for layers.31.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
        size mismatch for output.weight: copying a param with shape torch.Size([16032, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([128256, 4096]).

package version:

torch                             2.6.0.dev20241009+cu121
torchao                           0.7.0.dev20241010+cu121
torchtune                         0.4.0.dev20241010+cpu
torchvision                       0.20.0.dev20241009+cu121

Oct 11 '24 11:10 elfisworking

@elfisworking thanks for creating the issue. I intended to look at this today but unfortunately ran out of time before I could get to it, I am gonna tag as hi-pri to ensure someone takes a closer look asap

Oct 12 '24 03:10 ebsmothers