torchtune
torchtune copied to clipboard
torchtune generate function error when model used Int4WeightOnlyQATQuantizer
today, i try to use Int4WeightOnlyQATQuantizer to quantize llama3-8b when i use model generate function, i get below error:
Running InferenceRecipe with resolved config:
chat_format: null
checkpointer:
_component_: torchtune.training.FullModelTorchTuneCheckpointer
checkpoint_dir: /QAT/output/llama3-8B/
checkpoint_files:
- meta_model_0-4w-qat-module-swap.pt
model_type: LLAMA3
output_dir: /QAT/output/llama3-8B/
device: cuda
dtype: bf16
enable_kv_cache: true
instruct_template: null
max_new_tokens: 300
model:
_component_: torchtune.models.llama3.llama3_8b
prompt: Tell me a joke?
quantizer:
_component_: torchtune.training.quantization.Int4WeightOnlyQuantizer
groupsize: 256
seed: 42
temperature: 0.6
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
max_seq_len: null
path: /QAT/Meta-Llama-3-8B/original/tokenizer.model
top_k: 1
Setting manual seed to local seed 42. Local seed is seed + rank = 42 + 0
Traceback (most recent call last):
File "/usr/local/bin/tune", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torchtune/_cli/tune.py", line 49, in main
parser.run(args)
File "/usr/local/lib/python3.10/dist-packages/torchtune/_cli/tune.py", line 43, in run
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/torchtune/_cli/run.py", line 208, in _run_cmd
self._run_single_device(args, is_builtin=is_builtin)
File "/usr/local/lib/python3.10/dist-packages/torchtune/_cli/run.py", line 102, in _run_single_device
runpy.run_path(str(args.recipe), run_name="__main__")
File "/usr/lib/python3.10/runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/recipes/generate.py", line 229, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torchtune/config/_parse.py", line 99, in wrapper
sys.exit(recipe_main(conf))
File "/usr/local/lib/python3.10/dist-packages/recipes/generate.py", line 224, in main
recipe.setup(cfg=cfg)
File "/usr/local/lib/python3.10/dist-packages/recipes/generate.py", line 70, in setup
self._model = self._setup_model(
File "/usr/local/lib/python3.10/dist-packages/recipes/generate.py", line 89, in _setup_model
model.load_state_dict(model_state_dict, assign=True)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2584, in load_state_dict
raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for TransformerDecoder:
Unexpected key(s) in state_dict: "layers.0.attn.q_proj.scales_and_zeros", "layers.0.attn.k_proj.scales_and_zeros", "layers.0.attn.v_proj.scales_and_zeros", "layers.0.attn.output_proj.scales_and_zeros", "layers.0.mlp.w1.scales_and_zeros", "layers.0.mlp.w2.scales_and_zeros", "layers.0.mlp.w3.scales_and_zeros",
"layers.1.attn.q_proj.scales_and_zeros", "layers.1.attn.k_proj.scales_and_zeros", "layers.1.attn.v_proj.scales_and_zeros", "layers.1.attn.output_proj.scales_and_zeros", "layers.1.mlp.w1.scales_and_zeros", "layers.1.mlp.w2.scales_and_zeros", "layers.1.mlp.w3.scales_and_zeros", "layers.2.attn.q_proj.scales_and_zeros",
"layers.2.attn.k_proj.scales_and_zeros", "layers.2.attn.v_proj.scales_and_zeros", "layers.2.attn.output_proj.scales_and_zeros", "layers.2.mlp.w1.scales_and_zeros", "layers.2.mlp.w2.scales_and_zeros", "layers.2.mlp.w3.scales_and_zeros", "layers.3.attn.q_proj.scales_and_zeros", "layers.3.attn.k_proj.scales_and_zeros",
"layers.3.attn.v_proj.scales_and_zeros", "layers.3.attn.output_proj.scales_and_zeros", "layers.3.mlp.w1.scales_and_zeros", "layers.3.mlp.w2.scales_and_zeros", "layers.3.mlp.w3.scales_and_zeros", "layers.4.attn.q_proj.scales_and_zeros", "layers.4.attn.k_proj.scales_and_zeros", "layers.4.attn.v_proj.scales_and_zeros",
"layers.4.attn.output_proj.scales_and_zeros", "layers.4.mlp.w1.scales_and_zeros", "layers.4.mlp.w2.scales_and_zeros", "layers.4.mlp.w3.scales_and_zeros", "layers.5.attn.q_proj.scales_and_zeros", "layers.5.attn.k_proj.scales_and_zeros", "layers.5.attn.v_proj.scales_and_zeros", "layers.5.attn.output_proj.scales_and_zer
os", "layers.5.mlp.w1.scales_and_zeros", "layers.5.mlp.w2.scales_and_zeros", "layers.5.mlp.w3.scales_and_zeros", "layers.6.attn.q_proj.scales_and_zeros", "layers.6.attn.k_proj.scales_and_zeros", "layers.6.attn.v_proj.scales_and_zeros", "layers.6.attn.output_proj.scales_and_zeros", "layers.6.mlp.w1.scales_and_zeros",
"layers.6.mlp.w2.scales_and_zeros", "layers.6.mlp.w3.scales_and_zeros", "layers.7.attn.q_proj.scales_and_zeros", "layers.7.attn.k_proj.scales_and_zeros", "layers.7.attn.v_proj.scales_and_zeros", "layers.7.attn.output_proj.scales_and_zeros", "layers.7.mlp.w1.scales_and_zeros", "layers.7.mlp.w2.scales_and_zeros", "laye
rs.7.mlp.w3.scales_and_zeros", "layers.8.attn.q_proj.scales_and_zeros", "layers.8.attn.k_proj.scales_and_zeros", "layers.8.attn.v_proj.scales_and_zeros", "layers.8.attn.output_proj.scales_and_zeros", "layers.8.mlp.w1.scales_and_zeros", "layers.8.mlp.w2.scales_and_zeros", "layers.8.mlp.w3.scales_and_zeros", "layers.9.
attn.q_proj.scales_and_zeros", "layers.9.attn.k_proj.scales_and_zeros", "layers.9.attn.v_proj.scales_and_zeros", "layers.9.attn.output_proj.scales_and_zeros", "layers.9.mlp.w1.scales_and_zeros", "layers.9.mlp.w2.scales_and_zeros", "layers.9.mlp.w3.scales_and_zeros", "layers.10.attn.q_proj.scales_and_zeros", "layers.1
0.attn.k_proj.scales_and_zeros", "layers.10.attn.v_proj.scales_and_zeros", "layers.10.attn.output_proj.scales_and_zeros", "layers.10.mlp.w1.scales_and_zeros", "layers.10.mlp.w2.scales_and_zeros", "layers.10.mlp.w3.scales_and_zeros", "layers.11.attn.q_proj.scales_and_zeros", "layers.11.attn.k_proj.scales_and_zeros", "
layers.11.attn.v_proj.scales_and_zeros", "layers.11.attn.output_proj.scales_and_zeros", "layers.11.mlp.w1.scales_and_zeros", "layers.11.mlp.w2.scales_and_zeros", "layers.11.mlp.w3.scales_and_zeros", "layers.12.attn.q_proj.scales_and_zeros", "layers.12.attn.k_proj.scales_and_zeros", "layers.12.attn.v_proj.scales_and_z
eros", "layers.12.attn.output_proj.scales_and_zeros", "layers.12.mlp.w1.scales_and_zeros", "layers.12.mlp.w2.scales_and_zeros", "layers.12.mlp.w3.scales_and_zeros", "layers.13.attn.q_proj.scales_and_zeros", "layers.13.attn.k_proj.scales_and_zeros", "layers.13.attn.v_proj.scales_and_zeros", "layers.13.attn.output_proj
.scales_and_zeros", "layers.13.mlp.w1.scales_and_zeros", "layers.13.mlp.w2.scales_and_zeros", "layers.13.mlp.w3.scales_and_zeros", "layers.14.attn.q_proj.scales_and_zeros", "layers.14.attn.k_proj.scales_and_zeros", "layers.14.attn.v_proj.scales_and_zeros", "layers.14.attn.output_proj.scales_and_zeros", "layers.14.mlp
.w1.scales_and_zeros", "layers.14.mlp.w2.scales_and_zeros", "layers.14.mlp.w3.scales_and_zeros", "layers.15.attn.q_proj.scales_and_zeros", "layers.15.attn.k_proj.scales_and_zeros", "layers.15.attn.v_proj.scales_and_zeros", "layers.15.attn.output_proj.scales_and_zeros", "layers.15.mlp.w1.scales_and_zeros", "layers.15.
mlp.w2.scales_and_zeros", "layers.15.mlp.w3.scales_and_zeros", "layers.16.attn.q_proj.scales_and_zeros", "layers.16.attn.k_proj.scales_and_zeros", "layers.16.attn.v_proj.scales_and_zeros", "layers.16.attn.output_proj.scales_and_zeros", "layers.16.mlp.w1.scales_and_zeros", "layers.16.mlp.w2.scales_and_zeros", "layers.
16.mlp.w3.scales_and_zeros", "layers.17.attn.q_proj.scales_and_zeros", "layers.17.attn.k_proj.scales_and_zeros", "layers.17.attn.v_proj.scales_and_zeros", "layers.17.attn.output_proj.scales_and_zeros", "layers.17.mlp.w1.scales_and_zeros", "layers.17.mlp.w2.scales_and_zeros", "layers.17.mlp.w3.scales_and_zeros", "laye
rs.18.attn.q_proj.scales_and_zeros", "layers.18.attn.k_proj.scales_and_zeros", "layers.18.attn.v_proj.scales_and_zeros", "layers.18.attn.output_proj.scales_and_zeros", "layers.18.mlp.w1.scales_and_zeros", "layers.18.mlp.w2.scales_and_zeros", "layers.18.mlp.w3.scales_and_zeros", "layers.19.attn.q_proj.scales_and_zeros
", "layers.19.attn.k_proj.scales_and_zeros", "layers.19.attn.v_proj.scales_and_zeros", "layers.19.attn.output_proj.scales_and_zeros", "layers.19.mlp.w1.scales_and_zeros", "layers.19.mlp.w2.scales_and_zeros", "layers.19.mlp.w3.scales_and_zeros", "layers.20.attn.q_proj.scales_and_zeros", "layers.20.attn.k_proj.scales_a
nd_zeros", "layers.20.attn.v_proj.scales_and_zeros", "layers.20.attn.output_proj.scales_and_zeros", "layers.20.mlp.w1.scales_and_zeros", "layers.20.mlp.w2.scales_and_zeros", "layers.20.mlp.w3.scales_and_zeros", "layers.21.attn.q_proj.scales_and_zeros", "layers.21.attn.k_proj.scales_and_zeros", "layers.21.attn.v_proj.
scales_and_zeros", "layers.21.attn.output_proj.scales_and_zeros", "layers.21.mlp.w1.scales_and_zeros", "layers.21.mlp.w2.scales_and_zeros", "layers.21.mlp.w3.scales_and_zeros", "layers.22.attn.q_proj.scales_and_zeros", "layers.22.attn.k_proj.scales_and_zeros", "layers.22.attn.v_proj.scales_and_zeros", "layers.22.attn
.output_proj.scales_and_zeros", "layers.22.mlp.w1.scales_and_zeros", "layers.22.mlp.w2.scales_and_zeros", "layers.22.mlp.w3.scales_and_zeros", "layers.23.attn.q_proj.scales_and_zeros", "layers.23.attn.k_proj.scales_and_zeros", "layers.23.attn.v_proj.scales_and_zeros", "layers.23.attn.output_proj.scales_and_zeros", "l
ayers.23.mlp.w1.scales_and_zeros", "layers.23.mlp.w2.scales_and_zeros", "layers.23.mlp.w3.scales_and_zeros", "layers.24.attn.q_proj.scales_and_zeros", "layers.24.attn.k_proj.scales_and_zeros", "layers.24.attn.v_proj.scales_and_zeros", "layers.24.attn.output_proj.scales_and_zeros", "layers.24.mlp.w1.scales_and_zeros",
"layers.24.mlp.w2.scales_and_zeros", "layers.24.mlp.w3.scales_and_zeros", "layers.25.attn.q_proj.scales_and_zeros", "layers.25.attn.k_proj.scales_and_zeros", "layers.25.attn.v_proj.scales_and_zeros", "layers.25.attn.output_proj.scales_and_zeros", "layers.25.mlp.w1.scales_and_zeros", "layers.25.mlp.w2.scales_and_zero
s", "layers.25.mlp.w3.scales_and_zeros", "layers.26.attn.q_proj.scales_and_zeros", "layers.26.attn.k_proj.scales_and_zeros", "layers.26.attn.v_proj.scales_and_zeros", "layers.26.attn.output_proj.scales_and_zeros", "layers.26.mlp.w1.scales_and_zeros", "layers.26.mlp.w2.scales_and_zeros", "layers.26.mlp.w3.scales_and_z
eros", "layers.27.attn.q_proj.scales_and_zeros", "layers.27.attn.k_proj.scales_and_zeros", "layers.27.attn.v_proj.scales_and_zeros", "layers.27.attn.output_proj.scales_and_zeros", "layers.27.mlp.w1.scales_and_zeros", "layers.27.mlp.w2.scales_and_zeros", "layers.27.mlp.w3.scales_and_zeros", "layers.28.attn.q_proj.scal
es_and_zeros", "layers.28.attn.k_proj.scales_and_zeros", "layers.28.attn.v_proj.scales_and_zeros", "layers.28.attn.output_proj.scales_and_zeros", "layers.28.mlp.w1.scales_and_zeros", "layers.28.mlp.w2.scales_and_zeros", "layers.28.mlp.w3.scales_and_zeros", "layers.29.attn.q_proj.scales_and_zeros", "layers.29.attn.k_p
roj.scales_and_zeros", "layers.29.attn.v_proj.scales_and_zeros", "layers.29.attn.output_proj.scales_and_zeros", "layers.29.mlp.w1.scales_and_zeros", "layers.29.mlp.w2.scales_and_zeros", "layers.29.mlp.w3.scales_and_zeros", "layers.30.attn.q_proj.scales_and_zeros", "layers.30.attn.k_proj.scales_and_zeros", "layers.30.
attn.v_proj.scales_and_zeros", "layers.30.attn.output_proj.scales_and_zeros", "layers.30.mlp.w1.scales_and_zeros", "layers.30.mlp.w2.scales_and_zeros", "layers.30.mlp.w3.scales_and_zeros", "layers.31.attn.q_proj.scales_and_zeros", "layers.31.attn.k_proj.scales_and_zeros", "layers.31.attn.v_proj.scales_and_zeros", "la
yers.31.attn.output_proj.scales_and_zeros", "layers.31.mlp.w1.scales_and_zeros", "layers.31.mlp.w2.scales_and_zeros", "layers.31.mlp.w3.scales_and_zeros", "output.scales_and_zeros".
size mismatch for layers.0.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.0.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.0.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.0.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.0.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.0.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.0.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.1.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.1.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.1.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.1.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.1.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.1.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.1.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.2.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.2.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.2.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.2.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.2.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.2.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.2.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.3.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.3.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.3.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.3.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.3.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.3.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.3.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.4.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.4.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.4.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.4.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.4.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.4.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.4.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.5.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.5.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.5.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.5.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.5.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.5.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.5.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.6.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.6.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.6.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.6.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.6.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.6.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.6.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.7.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.7.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.7.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.7.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.7.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.7.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.7.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.8.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.8.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.8.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.8.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.8.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.8.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.8.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.9.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.9.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.9.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.9.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.9.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.9.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.9.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.10.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.10.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.10.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.10.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.10.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.10.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.10.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.11.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.11.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.11.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.11.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.11.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.11.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.11.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.12.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.12.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.12.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.12.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.12.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.12.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.12.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.13.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.13.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.13.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.13.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.13.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.13.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.13.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.14.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.14.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.14.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.14.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.14.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.14.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.14.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.15.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.15.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.15.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.15.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.15.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.15.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.15.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.16.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.16.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.16.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.16.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.16.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.16.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.16.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.17.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.17.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.17.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.17.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.17.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.17.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.17.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.18.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.18.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.18.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.18.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.18.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.18.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.18.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.19.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.19.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.19.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.19.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.19.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.19.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.19.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.20.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.20.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.20.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.20.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.20.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.20.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.20.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.21.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.21.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.21.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.21.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.21.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.21.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.21.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.22.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.22.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.22.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.22.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.22.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.22.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.22.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.23.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.23.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.23.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.23.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.23.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.23.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.23.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.24.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.24.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.24.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.24.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.24.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.24.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.24.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.25.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.25.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.25.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.25.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.25.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.25.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.25.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.26.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.26.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.26.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.26.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.26.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.26.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.26.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.27.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.27.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.27.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.27.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.27.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.27.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.27.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.28.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.28.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.28.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.28.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.28.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.28.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.28.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.29.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.29.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.29.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.29.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.29.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.29.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.29.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.30.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.30.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.30.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.30.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.30.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.30.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.30.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.31.attn.q_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.31.attn.k_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.31.attn.v_proj.weight: copying a param with shape torch.Size([128, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for layers.31.attn.output_proj.weight: copying a param with shape torch.Size([512, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for layers.31.mlp.w1.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for layers.31.mlp.w2.weight: copying a param with shape torch.Size([512, 112, 32, 4]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for layers.31.mlp.w3.weight: copying a param with shape torch.Size([1792, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for output.weight: copying a param with shape torch.Size([16032, 32, 32, 4]) from checkpoint, the shape in current model is torch.Size([128256, 4096]).
package version:
torch 2.6.0.dev20241009+cu121
torchao 0.7.0.dev20241010+cu121
torchtune 0.4.0.dev20241010+cpu
torchvision 0.20.0.dev20241009+cu121
@elfisworking thanks for creating the issue. I intended to look at this today but unfortunately ran out of time before I could get to it, I am gonna tag as hi-pri to ensure someone takes a closer look asap