convert_gpu_weights.py crashed by CUDA out of memory, even with --force_cpu
Reminder
- [x] I have read the above rules and searched the existing issues.
System Info
Intel(R) Xeon(R) Platinum 8461V + 3090 24G + 384G mem
Reproduction
๐ฏ Starting one-shot quantization...
2025-11-21T01:12:06.210574+0800 | reset | INFO - Compression lifecycle reset
2025-11-21T01:12:06.365277+0800 | _create_default_logger | INFO - Logging all LLM Compressor modifier-level logs to sparse_logs/21-11-2025_01.12.06.log
2025-11-21T01:12:06.365710+0800 | from_modifiers | INFO - Creating recipe from modifiers
2025-11-21T01:12:16.006866+0800 | initialize | INFO - Compression lifecycle initialized for 1 modifiers
2025-11-21T01:12:16.006997+0800 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `GPTQModifier`
Preparing cache: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1024/1024 [00:02<00:00, 471.54it/s]
(1/93): Calibrating: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1024/1024 [00:11<00:00, 88.18it/s]
(1/93): Propagating: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1024/1024 [00:14<00:00, 68.51it/s]
(2/93): Calibrating: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1024/1024 [00:11<00:00, 87.54it/s]
(2/93): Propagating: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1024/1024 [00:14<00:00, 72.97it/s]
(3/93): Calibrating: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1024/1024 [00:11<00:00, 86.87it/s]
(3/93): Propagating: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1024/1024 [00:12<00:00, 84.10it/s]
(4/93): Calibrating: 0%| | 0/1024 [00:02<?, ?it/s]
Traceback (most recent call last):
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/pipelines/sequential/helpers.py", line 73, in forward
outputs = forward_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<string>", line 5, in forward
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/modeling_layers.py", line 94, in __call__
return super().__call__(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 395, in forward
hidden_states = self.mlp(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 345, in forward
hidden_states = self.moe(hidden_states, topk_indices, topk_weights).view(*orig_shape)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 331, in moe
expert_output = expert(expert_input)
^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 223, in forward
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1879, in _call_impl
return inner()
^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1840, in inner
hook_result = hook(self, args, result)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/modifiers/utils/hooks.py", line 93, in wrapped_hook
return hook(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/gptq/base.py", line 230, in calibrate_module
self._hessians[module] = make_empty_hessian(module, device=init_device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/gptq/gptq_quantize.py", line 30, in make_empty_hessian
return torch.zeros((num_columns, num_columns), device=device, dtype=GPTQ_PRECISION)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB. GPU 0 has a total capacity of 23.57 GiB of which 65.19 MiB is free. Process 3964 has 254.00 MiB memory in use. Including non-PyTorch memory, this process has 23.23 GiB memory in use. Of the allocated memory 22.33 GiB is allocated by PyTorch, and 607.47 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/work/ktransformers/ktransformers/kt-kernel/scripts/convert_gpu_weights.py", line 376, in <module>
main()
File "/work/ktransformers/ktransformers/kt-kernel/scripts/convert_gpu_weights.py", line 360, in main
oneshot(
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/entrypoints/oneshot.py", line 330, in oneshot
one_shot()
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/entrypoints/oneshot.py", line 158, in __call__
self.apply_recipe_modifiers(
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/entrypoints/oneshot.py", line 201, in apply_recipe_modifiers
pipeline(
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/pipelines/independent/pipeline.py", line 45, in __call__
pipeline(model, dataloader, dataset_args)
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/pipelines/sequential/pipeline.py", line 104, in __call__
subgraph.forward(model, **inputs)
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/pipelines/sequential/helpers.py", line 75, in forward
raise RuntimeError(
RuntimeError: Raised an exception during execution of the following code:
1 2 3 4 def forward(self, model_layers_2, model_rotary_emb, wrapped_5, getitem_3, getitem_1): 5 model_layers_3 = getattr(self.model.layers, "3")(model_layers_2, attention_mask = wrapped_5, position_ids = getitem_3, past_key_values = None, cache_position = getitem_1, position_embeddings = model_rotary_emb); model_layers_2 = wrapped_5 = getitem_3 = getitem_1 = model_rotary_emb = None 6 return {'model_layers_3': model_layers_3} 7
Others
command: python scripts/convert_gpu_weights.py --model_id /media/data/models/GLM-4.6/ --output_dir /models/ZhipuAI/GLM-4.6-GPTQ8 --force_cpu --trust_remote_code --max_sequence_length 1024 --num_calibration_samples 1024 --quant_type W4A16
I saw that there is "# Force all modules to CPU for quantization if args.force_cpu:" , Does this mean enabling this parameter will make the quantization process use only memory and be unrelated to GPU memory? Otherwise, if there is enough GPU memory to loading full weight, I would't need convert as so.
I also encountered this bug, and by setting CUDA_VISIBLE_DEVICES="". I make the literal "force cpu". I think the config option for the scripts is not right and the scripts doc needs to be updated with more details. @ovowei @qiyuxinlin
@ovowei ้ช่ฏๆญคPRๆช่งฃๅณๆญค้ฎ้ข๏ผไธ่ฎบๆฏ้
็ฝฎ--max_gpu_memoryๅๆฐ่ฟๆฏ--force_cpu๏ผไปๆฅ CUDA out of memory๏ผ
root@hao-Super-Server:/work/ktransformers/ktransformers/kt-kernel# python scripts/convert_gpu_weights.py --model_id /media/data/models/GLM-4.6/ --output_dir /models/ZhipuAI/GLM-4.6-GPTQ4 --trust_remote_code --force_cpu --quant_type W4A16
๐ง Forced CPU-only mode
๐ Starting quantization process
Model: /media/data/models/GLM-4.6/
Output: /models/ZhipuAI/GLM-4.6-GPTQ4
Quantization: W4A16
Calibration samples: 512
Max sequence length: 2048
๐ Checking model configuration for dense layers...
โ
Found dense layers configuration: first_k_dense_replace = 3
Adding first 3 layers to ignore list...
Dense layer pattern added: re:model\.layers\.[0-2]\.mlp\..*$
This will ignore MLP components in layers 0-2
๐ Building CPU-only device map...
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 93/93 [00:06<00:00, 14.58it/s]
Some weights of the model checkpoint at /media/data/models/GLM-4.6/ were not used when initializing Glm4MoeForCausalLM: ['model.layers.92.eh_proj.weight', 'model.layers.92.enorm.weight', 'model.layers.92.hnorm.weight', 'model.layers.92.input_layernorm.weight', 'model.layers.92.mlp.experts.0.down_proj.weight', 'model.layers.92.mlp.experts.0.gate_proj.weight', 'model.layers.92.mlp.experts.0.up_proj.weight', 'model.layers.92.mlp.experts.1.down_proj.weight', 'model.layers.92.mlp.experts.1.gate_proj.weight', 'model.layers.92.mlp.experts.1.up_proj.weight', 'model.layers.92.mlp.experts.10.down_proj.weight', 'model.layers.92.mlp.experts.10.gate_proj.weight', 'model.layers.92.mlp.experts.10.up_proj.weight', 'model.layers.92.mlp.experts.100.down_proj.weight', 'model.layers.92.mlp.experts.100.gate_proj.weight', 'model.layers.92.mlp.experts.100.up_proj.weight', 'model.layers.92.mlp.experts.101.down_proj.weight', 'model.layers.92.mlp.experts.101.gate_proj.weight', 'model.layers.92.mlp.experts.101.up_proj.weight', 'model.layers.92.mlp.experts.102.down_proj.weight', 'model.layers.92.mlp.experts.102.gate_proj.weight', 'model.layers.92.mlp.experts.102.up_proj.weight', 'model.layers.92.mlp.experts.103.down_proj.weight', 'model.layers.92.mlp.experts.103.gate_proj.weight', 'model.layers.92.mlp.experts.103.up_proj.weight', 'model.layers.92.mlp.experts.104.down_proj.weight', 'model.layers.92.mlp.experts.104.gate_proj.weight', 'model.layers.92.mlp.experts.104.up_proj.weight', 'model.layers.92.mlp.experts.105.down_proj.weight', 'model.layers.92.mlp.experts.105.gate_proj.weight', 'model.layers.92.mlp.experts.105.up_proj.weight', 'model.layers.92.mlp.experts.106.down_proj.weight', 'model.layers.92.mlp.experts.106.gate_proj.weight', 'model.layers.92.mlp.experts.106.up_proj.weight', 'model.layers.92.mlp.experts.107.down_proj.weight', 'model.layers.92.mlp.experts.107.gate_proj.weight', 'model.layers.92.mlp.experts.107.up_proj.weight', 'model.layers.92.mlp.experts.108.down_proj.weight', 'model.layers.92.mlp.experts.108.gate_proj.weight', 'model.layers.92.mlp.experts.108.up_proj.weight', 'model.layers.92.mlp.experts.109.down_proj.weight', 'model.layers.92.mlp.experts.109.gate_proj.weight', 'model.layers.92.mlp.experts.109.up_proj.weight', 'model.layers.92.mlp.experts.11.down_proj.weight', 'model.layers.92.mlp.experts.11.gate_proj.weight', 'model.layers.92.mlp.experts.11.up_proj.weight', 'model.layers.92.mlp.experts.110.down_proj.weight', 'model.layers.92.mlp.experts.110.gate_proj.weight', 'model.layers.92.mlp.experts.110.up_proj.weight', 'model.layers.92.mlp.experts.111.down_proj.weight', 'model.layers.92.mlp.experts.111.gate_proj.weight', 'model.layers.92.mlp.experts.111.up_proj.weight', 'model.layers.92.mlp.experts.112.down_proj.weight', 'model.layers.92.mlp.experts.112.gate_proj.weight', 'model.layers.92.mlp.experts.112.up_proj.weight', 'model.layers.92.mlp.experts.113.down_proj.weight', 'model.layers.92.mlp.experts.113.gate_proj.weight', 'model.layers.92.mlp.experts.113.up_proj.weight', 'model.layers.92.mlp.experts.114.down_proj.weight', 'model.layers.92.mlp.experts.114.gate_proj.weight', 'model.layers.92.mlp.experts.114.up_proj.weight', 'model.layers.92.mlp.experts.115.down_proj.weight', 'model.layers.92.mlp.experts.115.gate_proj.weight', 'model.layers.92.mlp.experts.115.up_proj.weight', 'model.layers.92.mlp.experts.116.down_proj.weight', 'model.layers.92.mlp.experts.116.gate_proj.weight', 'model.layers.92.mlp.experts.116.up_proj.weight', 'model.layers.92.mlp.experts.117.down_proj.weight', 'model.layers.92.mlp.experts.117.gate_proj.weight', 'model.layers.92.mlp.experts.117.up_proj.weight', 'model.layers.92.mlp.experts.118.down_proj.weight', 'model.layers.92.mlp.experts.118.gate_proj.weight', 'model.layers.92.mlp.experts.118.up_proj.weight', 'model.layers.92.mlp.experts.119.down_proj.weight', 'model.layers.92.mlp.experts.119.gate_proj.weight', 'model.layers.92.mlp.experts.119.up_proj.weight', 'model.layers.92.mlp.experts.12.down_proj.weight', 'model.layers.92.mlp.experts.12.gate_proj.weight', 'model.layers.92.mlp.experts.12.up_proj.weight', 'model.layers.92.mlp.experts.120.down_proj.weight', 'model.layers.92.mlp.experts.120.gate_proj.weight', 'model.layers.92.mlp.experts.120.up_proj.weight', 'model.layers.92.mlp.experts.121.down_proj.weight', 'model.layers.92.mlp.experts.121.gate_proj.weight', 'model.layers.92.mlp.experts.121.up_proj.weight', 'model.layers.92.mlp.experts.122.down_proj.weight', 'model.layers.92.mlp.experts.122.gate_proj.weight', 'model.layers.92.mlp.experts.122.up_proj.weight', 'model.layers.92.mlp.experts.123.down_proj.weight', 'model.layers.92.mlp.experts.123.gate_proj.weight', 'model.layers.92.mlp.experts.123.up_proj.weight', 'model.layers.92.mlp.experts.124.down_proj.weight', 'model.layers.92.mlp.experts.124.gate_proj.weight', 'model.layers.92.mlp.experts.124.up_proj.weight', 'model.layers.92.mlp.experts.125.down_proj.weight', 'model.layers.92.mlp.experts.125.gate_proj.weight', 'model.layers.92.mlp.experts.125.up_proj.weight', 'model.layers.92.mlp.experts.126.down_proj.weight', 'model.layers.92.mlp.experts.126.gate_proj.weight', 'model.layers.92.mlp.experts.126.up_proj.weight', 'model.layers.92.mlp.experts.127.down_proj.weight', 'model.layers.92.mlp.experts.127.gate_proj.weight', 'model.layers.92.mlp.experts.127.up_proj.weight', 'model.layers.92.mlp.experts.128.down_proj.weight', 'model.layers.92.mlp.experts.128.gate_proj.weight', 'model.layers.92.mlp.experts.128.up_proj.weight', 'model.layers.92.mlp.experts.129.down_proj.weight', 'model.layers.92.mlp.experts.129.gate_proj.weight', 'model.layers.92.mlp.experts.129.up_proj.weight', 'model.layers.92.mlp.experts.13.down_proj.weight', 'model.layers.92.mlp.experts.13.gate_proj.weight', 'model.layers.92.mlp.experts.13.up_proj.weight', 'model.layers.92.mlp.experts.130.down_proj.weight', 'model.layers.92.mlp.experts.130.gate_proj.weight', 'model.layers.92.mlp.experts.130.up_proj.weight', 'model.layers.92.mlp.experts.131.down_proj.weight', 'model.layers.92.mlp.experts.131.gate_proj.weight', 'model.layers.92.mlp.experts.131.up_proj.weight', 'model.layers.92.mlp.experts.132.down_proj.weight', 'model.layers.92.mlp.experts.132.gate_proj.weight', 'model.layers.92.mlp.experts.132.up_proj.weight', 'model.layers.92.mlp.experts.133.down_proj.weight', 'model.layers.92.mlp.experts.133.gate_proj.weight', 'model.layers.92.mlp.experts.133.up_proj.weight', 'model.layers.92.mlp.experts.134.down_proj.weight', 'model.layers.92.mlp.experts.134.gate_proj.weight', 'model.layers.92.mlp.experts.134.up_proj.weight', 'model.layers.92.mlp.experts.135.down_proj.weight', 'model.layers.92.mlp.experts.135.gate_proj.weight', 'model.layers.92.mlp.experts.135.up_proj.weight', 'model.layers.92.mlp.experts.136.down_proj.weight', 'model.layers.92.mlp.experts.136.gate_proj.weight', 'model.layers.92.mlp.experts.136.up_proj.weight', 'model.layers.92.mlp.experts.137.down_proj.weight', 'model.layers.92.mlp.experts.137.gate_proj.weight', 'model.layers.92.mlp.experts.137.up_proj.weight', 'model.layers.92.mlp.experts.138.down_proj.weight', 'model.layers.92.mlp.experts.138.gate_proj.weight', 'model.layers.92.mlp.experts.138.up_proj.weight', 'model.layers.92.mlp.experts.139.down_proj.weight', 'model.layers.92.mlp.experts.139.gate_proj.weight', 'model.layers.92.mlp.experts.139.up_proj.weight', 'model.layers.92.mlp.experts.14.down_proj.weight', 'model.layers.92.mlp.experts.14.gate_proj.weight', 'model.layers.92.mlp.experts.14.up_proj.weight', 'model.layers.92.mlp.experts.140.down_proj.weight', 'model.layers.92.mlp.experts.140.gate_proj.weight', 'model.layers.92.mlp.experts.140.up_proj.weight', 'model.layers.92.mlp.experts.141.down_proj.weight', 'model.layers.92.mlp.experts.141.gate_proj.weight', 'model.layers.92.mlp.experts.141.up_proj.weight', 'model.layers.92.mlp.experts.142.down_proj.weight', 'model.layers.92.mlp.experts.142.gate_proj.weight', 'model.layers.92.mlp.experts.142.up_proj.weight', 'model.layers.92.mlp.experts.143.down_proj.weight', 'model.layers.92.mlp.experts.143.gate_proj.weight', 'model.layers.92.mlp.experts.143.up_proj.weight', 'model.layers.92.mlp.experts.144.down_proj.weight', 'model.layers.92.mlp.experts.144.gate_proj.weight', 'model.layers.92.mlp.experts.144.up_proj.weight', 'model.layers.92.mlp.experts.145.down_proj.weight', 'model.layers.92.mlp.experts.145.gate_proj.weight', 'model.layers.92.mlp.experts.145.up_proj.weight', 'model.layers.92.mlp.experts.146.down_proj.weight', 'model.layers.92.mlp.experts.146.gate_proj.weight', 'model.layers.92.mlp.experts.146.up_proj.weight', 'model.layers.92.mlp.experts.147.down_proj.weight', 'model.layers.92.mlp.experts.147.gate_proj.weight', 'model.layers.92.mlp.experts.147.up_proj.weight', 'model.layers.92.mlp.experts.148.down_proj.weight', 'model.layers.92.mlp.experts.148.gate_proj.weight', 'model.layers.92.mlp.experts.148.up_proj.weight', 'model.layers.92.mlp.experts.149.down_proj.weight', 'model.layers.92.mlp.experts.149.gate_proj.weight', 'model.layers.92.mlp.experts.149.up_proj.weight', 'model.layers.92.mlp.experts.15.down_proj.weight', 'model.layers.92.mlp.experts.15.gate_proj.weight', 'model.layers.92.mlp.experts.15.up_proj.weight', 'model.layers.92.mlp.experts.150.down_proj.weight', 'model.layers.92.mlp.experts.150.gate_proj.weight', 'model.layers.92.mlp.experts.150.up_proj.weight', 'model.layers.92.mlp.experts.151.down_proj.weight', 'model.layers.92.mlp.experts.151.gate_proj.weight', 'model.layers.92.mlp.experts.151.up_proj.weight', 'model.layers.92.mlp.experts.152.down_proj.weight', 'model.layers.92.mlp.experts.152.gate_proj.weight', 'model.layers.92.mlp.experts.152.up_proj.weight', 'model.layers.92.mlp.experts.153.down_proj.weight', 'model.layers.92.mlp.experts.153.gate_proj.weight', 'model.layers.92.mlp.experts.153.up_proj.weight', 'model.layers.92.mlp.experts.154.down_proj.weight', 'model.layers.92.mlp.experts.154.gate_proj.weight', 'model.layers.92.mlp.experts.154.up_proj.weight', 'model.layers.92.mlp.experts.155.down_proj.weight', 'model.layers.92.mlp.experts.155.gate_proj.weight', 'model.layers.92.mlp.experts.155.up_proj.weight', 'model.layers.92.mlp.experts.156.down_proj.weight', 'model.layers.92.mlp.experts.156.gate_proj.weight', 'model.layers.92.mlp.experts.156.up_proj.weight', 'model.layers.92.mlp.experts.157.down_proj.weight', 'model.layers.92.mlp.experts.157.gate_proj.weight', 'model.layers.92.mlp.experts.157.up_proj.weight', 'model.layers.92.mlp.experts.158.down_proj.weight', 'model.layers.92.mlp.experts.158.gate_proj.weight', 'model.layers.92.mlp.experts.158.up_proj.weight', 'model.layers.92.mlp.experts.159.down_proj.weight', 'model.layers.92.mlp.experts.159.gate_proj.weight', 'model.layers.92.mlp.experts.159.up_proj.weight', 'model.layers.92.mlp.experts.16.down_proj.weight', 'model.layers.92.mlp.experts.16.gate_proj.weight', 'model.layers.92.mlp.experts.16.up_proj.weight', 'model.layers.92.mlp.experts.17.down_proj.weight', 'model.layers.92.mlp.experts.17.gate_proj.weight', 'model.layers.92.mlp.experts.17.up_proj.weight', 'model.layers.92.mlp.experts.18.down_proj.weight', 'model.layers.92.mlp.experts.18.gate_proj.weight', 'model.layers.92.mlp.experts.18.up_proj.weight', 'model.layers.92.mlp.experts.19.down_proj.weight', 'model.layers.92.mlp.experts.19.gate_proj.weight', 'model.layers.92.mlp.experts.19.up_proj.weight', 'model.layers.92.mlp.experts.2.down_proj.weight', 'model.layers.92.mlp.experts.2.gate_proj.weight', 'model.layers.92.mlp.experts.2.up_proj.weight', 'model.layers.92.mlp.experts.20.down_proj.weight', 'model.layers.92.mlp.experts.20.gate_proj.weight', 'model.layers.92.mlp.experts.20.up_proj.weight', 'model.layers.92.mlp.experts.21.down_proj.weight', 'model.layers.92.mlp.experts.21.gate_proj.weight', 'model.layers.92.mlp.experts.21.up_proj.weight', 'model.layers.92.mlp.experts.22.down_proj.weight', 'model.layers.92.mlp.experts.22.gate_proj.weight', 'model.layers.92.mlp.experts.22.up_proj.weight', 'model.layers.92.mlp.experts.23.down_proj.weight', 'model.layers.92.mlp.experts.23.gate_proj.weight', 'model.layers.92.mlp.experts.23.up_proj.weight', 'model.layers.92.mlp.experts.24.down_proj.weight', 'model.layers.92.mlp.experts.24.gate_proj.weight', 'model.layers.92.mlp.experts.24.up_proj.weight', 'model.layers.92.mlp.experts.25.down_proj.weight', 'model.layers.92.mlp.experts.25.gate_proj.weight', 'model.layers.92.mlp.experts.25.up_proj.weight', 'model.layers.92.mlp.experts.26.down_proj.weight', 'model.layers.92.mlp.experts.26.gate_proj.weight', 'model.layers.92.mlp.experts.26.up_proj.weight', 'model.layers.92.mlp.experts.27.down_proj.weight', 'model.layers.92.mlp.experts.27.gate_proj.weight', 'model.layers.92.mlp.experts.27.up_proj.weight', 'model.layers.92.mlp.experts.28.down_proj.weight', 'model.layers.92.mlp.experts.28.gate_proj.weight', 'model.layers.92.mlp.experts.28.up_proj.weight', 'model.layers.92.mlp.experts.29.down_proj.weight', 'model.layers.92.mlp.experts.29.gate_proj.weight', 'model.layers.92.mlp.experts.29.up_proj.weight', 'model.layers.92.mlp.experts.3.down_proj.weight', 'model.layers.92.mlp.experts.3.gate_proj.weight', 'model.layers.92.mlp.experts.3.up_proj.weight', 'model.layers.92.mlp.experts.30.down_proj.weight', 'model.layers.92.mlp.experts.30.gate_proj.weight', 'model.layers.92.mlp.experts.30.up_proj.weight', 'model.layers.92.mlp.experts.31.down_proj.weight', 'model.layers.92.mlp.experts.31.gate_proj.weight', 'model.layers.92.mlp.experts.31.up_proj.weight', 'model.layers.92.mlp.experts.32.down_proj.weight', 'model.layers.92.mlp.experts.32.gate_proj.weight', 'model.layers.92.mlp.experts.32.up_proj.weight', 'model.layers.92.mlp.experts.33.down_proj.weight', 'model.layers.92.mlp.experts.33.gate_proj.weight', 'model.layers.92.mlp.experts.33.up_proj.weight', 'model.layers.92.mlp.experts.34.down_proj.weight', 'model.layers.92.mlp.experts.34.gate_proj.weight', 'model.layers.92.mlp.experts.34.up_proj.weight', 'model.layers.92.mlp.experts.35.down_proj.weight', 'model.layers.92.mlp.experts.35.gate_proj.weight', 'model.layers.92.mlp.experts.35.up_proj.weight', 'model.layers.92.mlp.experts.36.down_proj.weight', 'model.layers.92.mlp.experts.36.gate_proj.weight', 'model.layers.92.mlp.experts.36.up_proj.weight', 'model.layers.92.mlp.experts.37.down_proj.weight', 'model.layers.92.mlp.experts.37.gate_proj.weight', 'model.layers.92.mlp.experts.37.up_proj.weight', 'model.layers.92.mlp.experts.38.down_proj.weight', 'model.layers.92.mlp.experts.38.gate_proj.weight', 'model.layers.92.mlp.experts.38.up_proj.weight', 'model.layers.92.mlp.experts.39.down_proj.weight', 'model.layers.92.mlp.experts.39.gate_proj.weight', 'model.layers.92.mlp.experts.39.up_proj.weight', 'model.layers.92.mlp.experts.4.down_proj.weight', 'model.layers.92.mlp.experts.4.gate_proj.weight', 'model.layers.92.mlp.experts.4.up_proj.weight', 'model.layers.92.mlp.experts.40.down_proj.weight', 'model.layers.92.mlp.experts.40.gate_proj.weight', 'model.layers.92.mlp.experts.40.up_proj.weight', 'model.layers.92.mlp.experts.41.down_proj.weight', 'model.layers.92.mlp.experts.41.gate_proj.weight', 'model.layers.92.mlp.experts.41.up_proj.weight', 'model.layers.92.mlp.experts.42.down_proj.weight', 'model.layers.92.mlp.experts.42.gate_proj.weight', 'model.layers.92.mlp.experts.42.up_proj.weight', 'model.layers.92.mlp.experts.43.down_proj.weight', 'model.layers.92.mlp.experts.43.gate_proj.weight', 'model.layers.92.mlp.experts.43.up_proj.weight', 'model.layers.92.mlp.experts.44.down_proj.weight', 'model.layers.92.mlp.experts.44.gate_proj.weight', 'model.layers.92.mlp.experts.44.up_proj.weight', 'model.layers.92.mlp.experts.45.down_proj.weight', 'model.layers.92.mlp.experts.45.gate_proj.weight', 'model.layers.92.mlp.experts.45.up_proj.weight', 'model.layers.92.mlp.experts.46.down_proj.weight', 'model.layers.92.mlp.experts.46.gate_proj.weight', 'model.layers.92.mlp.experts.46.up_proj.weight', 'model.layers.92.mlp.experts.47.down_proj.weight', 'model.layers.92.mlp.experts.47.gate_proj.weight', 'model.layers.92.mlp.experts.47.up_proj.weight', 'model.layers.92.mlp.experts.48.down_proj.weight', 'model.layers.92.mlp.experts.48.gate_proj.weight', 'model.layers.92.mlp.experts.48.up_proj.weight', 'model.layers.92.mlp.experts.49.down_proj.weight', 'model.layers.92.mlp.experts.49.gate_proj.weight', 'model.layers.92.mlp.experts.49.up_proj.weight', 'model.layers.92.mlp.experts.5.down_proj.weight', 'model.layers.92.mlp.experts.5.gate_proj.weight', 'model.layers.92.mlp.experts.5.up_proj.weight', 'model.layers.92.mlp.experts.50.down_proj.weight', 'model.layers.92.mlp.experts.50.gate_proj.weight', 'model.layers.92.mlp.experts.50.up_proj.weight', 'model.layers.92.mlp.experts.51.down_proj.weight', 'model.layers.92.mlp.experts.51.gate_proj.weight', 'model.layers.92.mlp.experts.51.up_proj.weight', 'model.layers.92.mlp.experts.52.down_proj.weight', 'model.layers.92.mlp.experts.52.gate_proj.weight', 'model.layers.92.mlp.experts.52.up_proj.weight', 'model.layers.92.mlp.experts.53.down_proj.weight', 'model.layers.92.mlp.experts.53.gate_proj.weight', 'model.layers.92.mlp.experts.53.up_proj.weight', 'model.layers.92.mlp.experts.54.down_proj.weight', 'model.layers.92.mlp.experts.54.gate_proj.weight', 'model.layers.92.mlp.experts.54.up_proj.weight', 'model.layers.92.mlp.experts.55.down_proj.weight', 'model.layers.92.mlp.experts.55.gate_proj.weight', 'model.layers.92.mlp.experts.55.up_proj.weight', 'model.layers.92.mlp.experts.56.down_proj.weight', 'model.layers.92.mlp.experts.56.gate_proj.weight', 'model.layers.92.mlp.experts.56.up_proj.weight', 'model.layers.92.mlp.experts.57.down_proj.weight', 'model.layers.92.mlp.experts.57.gate_proj.weight', 'model.layers.92.mlp.experts.57.up_proj.weight', 'model.layers.92.mlp.experts.58.down_proj.weight', 'model.layers.92.mlp.experts.58.gate_proj.weight', 'model.layers.92.mlp.experts.58.up_proj.weight', 'model.layers.92.mlp.experts.59.down_proj.weight', 'model.layers.92.mlp.experts.59.gate_proj.weight', 'model.layers.92.mlp.experts.59.up_proj.weight', 'model.layers.92.mlp.experts.6.down_proj.weight', 'model.layers.92.mlp.experts.6.gate_proj.weight', 'model.layers.92.mlp.experts.6.up_proj.weight', 'model.layers.92.mlp.experts.60.down_proj.weight', 'model.layers.92.mlp.experts.60.gate_proj.weight', 'model.layers.92.mlp.experts.60.up_proj.weight', 'model.layers.92.mlp.experts.61.down_proj.weight', 'model.layers.92.mlp.experts.61.gate_proj.weight', 'model.layers.92.mlp.experts.61.up_proj.weight', 'model.layers.92.mlp.experts.62.down_proj.weight', 'model.layers.92.mlp.experts.62.gate_proj.weight', 'model.layers.92.mlp.experts.62.up_proj.weight', 'model.layers.92.mlp.experts.63.down_proj.weight', 'model.layers.92.mlp.experts.63.gate_proj.weight', 'model.layers.92.mlp.experts.63.up_proj.weight', 'model.layers.92.mlp.experts.64.down_proj.weight', 'model.layers.92.mlp.experts.64.gate_proj.weight', 'model.layers.92.mlp.experts.64.up_proj.weight', 'model.layers.92.mlp.experts.65.down_proj.weight', 'model.layers.92.mlp.experts.65.gate_proj.weight', 'model.layers.92.mlp.experts.65.up_proj.weight', 'model.layers.92.mlp.experts.66.down_proj.weight', 'model.layers.92.mlp.experts.66.gate_proj.weight', 'model.layers.92.mlp.experts.66.up_proj.weight', 'model.layers.92.mlp.experts.67.down_proj.weight', 'model.layers.92.mlp.experts.67.gate_proj.weight', 'model.layers.92.mlp.experts.67.up_proj.weight', 'model.layers.92.mlp.experts.68.down_proj.weight', 'model.layers.92.mlp.experts.68.gate_proj.weight', 'model.layers.92.mlp.experts.68.up_proj.weight', 'model.layers.92.mlp.experts.69.down_proj.weight', 'model.layers.92.mlp.experts.69.gate_proj.weight', 'model.layers.92.mlp.experts.69.up_proj.weight', 'model.layers.92.mlp.experts.7.down_proj.weight', 'model.layers.92.mlp.experts.7.gate_proj.weight', 'model.layers.92.mlp.experts.7.up_proj.weight', 'model.layers.92.mlp.experts.70.down_proj.weight', 'model.layers.92.mlp.experts.70.gate_proj.weight', 'model.layers.92.mlp.experts.70.up_proj.weight', 'model.layers.92.mlp.experts.71.down_proj.weight', 'model.layers.92.mlp.experts.71.gate_proj.weight', 'model.layers.92.mlp.experts.71.up_proj.weight', 'model.layers.92.mlp.experts.72.down_proj.weight', 'model.layers.92.mlp.experts.72.gate_proj.weight', 'model.layers.92.mlp.experts.72.up_proj.weight', 'model.layers.92.mlp.experts.73.down_proj.weight', 'model.layers.92.mlp.experts.73.gate_proj.weight', 'model.layers.92.mlp.experts.73.up_proj.weight', 'model.layers.92.mlp.experts.74.down_proj.weight', 'model.layers.92.mlp.experts.74.gate_proj.weight', 'model.layers.92.mlp.experts.74.up_proj.weight', 'model.layers.92.mlp.experts.75.down_proj.weight', 'model.layers.92.mlp.experts.75.gate_proj.weight', 'model.layers.92.mlp.experts.75.up_proj.weight', 'model.layers.92.mlp.experts.76.down_proj.weight', 'model.layers.92.mlp.experts.76.gate_proj.weight', 'model.layers.92.mlp.experts.76.up_proj.weight', 'model.layers.92.mlp.experts.77.down_proj.weight', 'model.layers.92.mlp.experts.77.gate_proj.weight', 'model.layers.92.mlp.experts.77.up_proj.weight', 'model.layers.92.mlp.experts.78.down_proj.weight', 'model.layers.92.mlp.experts.78.gate_proj.weight', 'model.layers.92.mlp.experts.78.up_proj.weight', 'model.layers.92.mlp.experts.79.down_proj.weight', 'model.layers.92.mlp.experts.79.gate_proj.weight', 'model.layers.92.mlp.experts.79.up_proj.weight', 'model.layers.92.mlp.experts.8.down_proj.weight', 'model.layers.92.mlp.experts.8.gate_proj.weight', 'model.layers.92.mlp.experts.8.up_proj.weight', 'model.layers.92.mlp.experts.80.down_proj.weight', 'model.layers.92.mlp.experts.80.gate_proj.weight', 'model.layers.92.mlp.experts.80.up_proj.weight', 'model.layers.92.mlp.experts.81.down_proj.weight', 'model.layers.92.mlp.experts.81.gate_proj.weight', 'model.layers.92.mlp.experts.81.up_proj.weight', 'model.layers.92.mlp.experts.82.down_proj.weight', 'model.layers.92.mlp.experts.82.gate_proj.weight', 'model.layers.92.mlp.experts.82.up_proj.weight', 'model.layers.92.mlp.experts.83.down_proj.weight', 'model.layers.92.mlp.experts.83.gate_proj.weight', 'model.layers.92.mlp.experts.83.up_proj.weight', 'model.layers.92.mlp.experts.84.down_proj.weight', 'model.layers.92.mlp.experts.84.gate_proj.weight', 'model.layers.92.mlp.experts.84.up_proj.weight', 'model.layers.92.mlp.experts.85.down_proj.weight', 'model.layers.92.mlp.experts.85.gate_proj.weight', 'model.layers.92.mlp.experts.85.up_proj.weight', 'model.layers.92.mlp.experts.86.down_proj.weight', 'model.layers.92.mlp.experts.86.gate_proj.weight', 'model.layers.92.mlp.experts.86.up_proj.weight', 'model.layers.92.mlp.experts.87.down_proj.weight', 'model.layers.92.mlp.experts.87.gate_proj.weight', 'model.layers.92.mlp.experts.87.up_proj.weight', 'model.layers.92.mlp.experts.88.down_proj.weight', 'model.layers.92.mlp.experts.88.gate_proj.weight', 'model.layers.92.mlp.experts.88.up_proj.weight', 'model.layers.92.mlp.experts.89.down_proj.weight', 'model.layers.92.mlp.experts.89.gate_proj.weight', 'model.layers.92.mlp.experts.89.up_proj.weight', 'model.layers.92.mlp.experts.9.down_proj.weight', 'model.layers.92.mlp.experts.9.gate_proj.weight', 'model.layers.92.mlp.experts.9.up_proj.weight', 'model.layers.92.mlp.experts.90.down_proj.weight', 'model.layers.92.mlp.experts.90.gate_proj.weight', 'model.layers.92.mlp.experts.90.up_proj.weight', 'model.layers.92.mlp.experts.91.down_proj.weight', 'model.layers.92.mlp.experts.91.gate_proj.weight', 'model.layers.92.mlp.experts.91.up_proj.weight', 'model.layers.92.mlp.experts.92.down_proj.weight', 'model.layers.92.mlp.experts.92.gate_proj.weight', 'model.layers.92.mlp.experts.92.up_proj.weight', 'model.layers.92.mlp.experts.93.down_proj.weight', 'model.layers.92.mlp.experts.93.gate_proj.weight', 'model.layers.92.mlp.experts.93.up_proj.weight', 'model.layers.92.mlp.experts.94.down_proj.weight', 'model.layers.92.mlp.experts.94.gate_proj.weight', 'model.layers.92.mlp.experts.94.up_proj.weight', 'model.layers.92.mlp.experts.95.down_proj.weight', 'model.layers.92.mlp.experts.95.gate_proj.weight', 'model.layers.92.mlp.experts.95.up_proj.weight', 'model.layers.92.mlp.experts.96.down_proj.weight', 'model.layers.92.mlp.experts.96.gate_proj.weight', 'model.layers.92.mlp.experts.96.up_proj.weight', 'model.layers.92.mlp.experts.97.down_proj.weight', 'model.layers.92.mlp.experts.97.gate_proj.weight', 'model.layers.92.mlp.experts.97.up_proj.weight', 'model.layers.92.mlp.experts.98.down_proj.weight', 'model.layers.92.mlp.experts.98.gate_proj.weight', 'model.layers.92.mlp.experts.98.up_proj.weight', 'model.layers.92.mlp.experts.99.down_proj.weight', 'model.layers.92.mlp.experts.99.gate_proj.weight', 'model.layers.92.mlp.experts.99.up_proj.weight', 'model.layers.92.mlp.gate.e_score_correction_bias', 'model.layers.92.mlp.gate.weight', 'model.layers.92.mlp.shared_experts.down_proj.weight', 'model.layers.92.mlp.shared_experts.gate_proj.weight', 'model.layers.92.mlp.shared_experts.up_proj.weight', 'model.layers.92.post_attention_layernorm.weight', 'model.layers.92.self_attn.k_norm.weight', 'model.layers.92.self_attn.k_proj.bias', 'model.layers.92.self_attn.k_proj.weight', 'model.layers.92.self_attn.o_proj.weight', 'model.layers.92.self_attn.q_norm.weight', 'model.layers.92.self_attn.q_proj.bias', 'model.layers.92.self_attn.q_proj.weight', 'model.layers.92.self_attn.v_proj.bias', 'model.layers.92.self_attn.v_proj.weight', 'model.layers.92.shared_head.norm.weight']
- This IS expected if you are initializing Glm4MoeForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Glm4MoeForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
The module name (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
๐ฅ Loading model...
Loading checkpoint shards: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 93/93 [01:01<00:00, 1.52it/s]
Some weights of the model checkpoint at /media/data/models/GLM-4.6/ were not used when initializing Glm4MoeForCausalLM: ['model.layers.92.eh_proj.weight', 'model.layers.92.enorm.weight', 'model.layers.92.hnorm.weight', 'model.layers.92.input_layernorm.weight', 'model.layers.92.mlp.experts.0.down_proj.weight', 'model.layers.92.mlp.experts.0.gate_proj.weight', 'model.layers.92.mlp.experts.0.up_proj.weight', 'model.layers.92.mlp.experts.1.down_proj.weight', 'model.layers.92.mlp.experts.1.gate_proj.weight', 'model.layers.92.mlp.experts.1.up_proj.weight', 'model.layers.92.mlp.experts.10.down_proj.weight', 'model.layers.92.mlp.experts.10.gate_proj.weight', 'model.layers.92.mlp.experts.10.up_proj.weight', 'model.layers.92.mlp.experts.100.down_proj.weight', 'model.layers.92.mlp.experts.100.gate_proj.weight', 'model.layers.92.mlp.experts.100.up_proj.weight', 'model.layers.92.mlp.experts.101.down_proj.weight', 'model.layers.92.mlp.experts.101.gate_proj.weight', 'model.layers.92.mlp.experts.101.up_proj.weight', 'model.layers.92.mlp.experts.102.down_proj.weight', 'model.layers.92.mlp.experts.102.gate_proj.weight', 'model.layers.92.mlp.experts.102.up_proj.weight', 'model.layers.92.mlp.experts.103.down_proj.weight', 'model.layers.92.mlp.experts.103.gate_proj.weight', 'model.layers.92.mlp.experts.103.up_proj.weight', 'model.layers.92.mlp.experts.104.down_proj.weight', 'model.layers.92.mlp.experts.104.gate_proj.weight', 'model.layers.92.mlp.experts.104.up_proj.weight', 'model.layers.92.mlp.experts.105.down_proj.weight', 'model.layers.92.mlp.experts.105.gate_proj.weight', 'model.layers.92.mlp.experts.105.up_proj.weight', 'model.layers.92.mlp.experts.106.down_proj.weight', 'model.layers.92.mlp.experts.106.gate_proj.weight', 'model.layers.92.mlp.experts.106.up_proj.weight', 'model.layers.92.mlp.experts.107.down_proj.weight', 'model.layers.92.mlp.experts.107.gate_proj.weight', 'model.layers.92.mlp.experts.107.up_proj.weight', 'model.layers.92.mlp.experts.108.down_proj.weight', 'model.layers.92.mlp.experts.108.gate_proj.weight', 'model.layers.92.mlp.experts.108.up_proj.weight', 'model.layers.92.mlp.experts.109.down_proj.weight', 'model.layers.92.mlp.experts.109.gate_proj.weight', 'model.layers.92.mlp.experts.109.up_proj.weight', 'model.layers.92.mlp.experts.11.down_proj.weight', 'model.layers.92.mlp.experts.11.gate_proj.weight', 'model.layers.92.mlp.experts.11.up_proj.weight', 'model.layers.92.mlp.experts.110.down_proj.weight', 'model.layers.92.mlp.experts.110.gate_proj.weight', 'model.layers.92.mlp.experts.110.up_proj.weight', 'model.layers.92.mlp.experts.111.down_proj.weight', 'model.layers.92.mlp.experts.111.gate_proj.weight', 'model.layers.92.mlp.experts.111.up_proj.weight', 'model.layers.92.mlp.experts.112.down_proj.weight', 'model.layers.92.mlp.experts.112.gate_proj.weight', 'model.layers.92.mlp.experts.112.up_proj.weight', 'model.layers.92.mlp.experts.113.down_proj.weight', 'model.layers.92.mlp.experts.113.gate_proj.weight', 'model.layers.92.mlp.experts.113.up_proj.weight', 'model.layers.92.mlp.experts.114.down_proj.weight', 'model.layers.92.mlp.experts.114.gate_proj.weight', 'model.layers.92.mlp.experts.114.up_proj.weight', 'model.layers.92.mlp.experts.115.down_proj.weight', 'model.layers.92.mlp.experts.115.gate_proj.weight', 'model.layers.92.mlp.experts.115.up_proj.weight', 'model.layers.92.mlp.experts.116.down_proj.weight', 'model.layers.92.mlp.experts.116.gate_proj.weight', 'model.layers.92.mlp.experts.116.up_proj.weight', 'model.layers.92.mlp.experts.117.down_proj.weight', 'model.layers.92.mlp.experts.117.gate_proj.weight', 'model.layers.92.mlp.experts.117.up_proj.weight', 'model.layers.92.mlp.experts.118.down_proj.weight', 'model.layers.92.mlp.experts.118.gate_proj.weight', 'model.layers.92.mlp.experts.118.up_proj.weight', 'model.layers.92.mlp.experts.119.down_proj.weight', 'model.layers.92.mlp.experts.119.gate_proj.weight', 'model.layers.92.mlp.experts.119.up_proj.weight', 'model.layers.92.mlp.experts.12.down_proj.weight', 'model.layers.92.mlp.experts.12.gate_proj.weight', 'model.layers.92.mlp.experts.12.up_proj.weight', 'model.layers.92.mlp.experts.120.down_proj.weight', 'model.layers.92.mlp.experts.120.gate_proj.weight', 'model.layers.92.mlp.experts.120.up_proj.weight', 'model.layers.92.mlp.experts.121.down_proj.weight', 'model.layers.92.mlp.experts.121.gate_proj.weight', 'model.layers.92.mlp.experts.121.up_proj.weight', 'model.layers.92.mlp.experts.122.down_proj.weight', 'model.layers.92.mlp.experts.122.gate_proj.weight', 'model.layers.92.mlp.experts.122.up_proj.weight', 'model.layers.92.mlp.experts.123.down_proj.weight', 'model.layers.92.mlp.experts.123.gate_proj.weight', 'model.layers.92.mlp.experts.123.up_proj.weight', 'model.layers.92.mlp.experts.124.down_proj.weight', 'model.layers.92.mlp.experts.124.gate_proj.weight', 'model.layers.92.mlp.experts.124.up_proj.weight', 'model.layers.92.mlp.experts.125.down_proj.weight', 'model.layers.92.mlp.experts.125.gate_proj.weight', 'model.layers.92.mlp.experts.125.up_proj.weight', 'model.layers.92.mlp.experts.126.down_proj.weight', 'model.layers.92.mlp.experts.126.gate_proj.weight', 'model.layers.92.mlp.experts.126.up_proj.weight', 'model.layers.92.mlp.experts.127.down_proj.weight', 'model.layers.92.mlp.experts.127.gate_proj.weight', 'model.layers.92.mlp.experts.127.up_proj.weight', 'model.layers.92.mlp.experts.128.down_proj.weight', 'model.layers.92.mlp.experts.128.gate_proj.weight', 'model.layers.92.mlp.experts.128.up_proj.weight', 'model.layers.92.mlp.experts.129.down_proj.weight', 'model.layers.92.mlp.experts.129.gate_proj.weight', 'model.layers.92.mlp.experts.129.up_proj.weight', 'model.layers.92.mlp.experts.13.down_proj.weight', 'model.layers.92.mlp.experts.13.gate_proj.weight', 'model.layers.92.mlp.experts.13.up_proj.weight', 'model.layers.92.mlp.experts.130.down_proj.weight', 'model.layers.92.mlp.experts.130.gate_proj.weight', 'model.layers.92.mlp.experts.130.up_proj.weight', 'model.layers.92.mlp.experts.131.down_proj.weight', 'model.layers.92.mlp.experts.131.gate_proj.weight', 'model.layers.92.mlp.experts.131.up_proj.weight', 'model.layers.92.mlp.experts.132.down_proj.weight', 'model.layers.92.mlp.experts.132.gate_proj.weight', 'model.layers.92.mlp.experts.132.up_proj.weight', 'model.layers.92.mlp.experts.133.down_proj.weight', 'model.layers.92.mlp.experts.133.gate_proj.weight', 'model.layers.92.mlp.experts.133.up_proj.weight', 'model.layers.92.mlp.experts.134.down_proj.weight', 'model.layers.92.mlp.experts.134.gate_proj.weight', 'model.layers.92.mlp.experts.134.up_proj.weight', 'model.layers.92.mlp.experts.135.down_proj.weight', 'model.layers.92.mlp.experts.135.gate_proj.weight', 'model.layers.92.mlp.experts.135.up_proj.weight', 'model.layers.92.mlp.experts.136.down_proj.weight', 'model.layers.92.mlp.experts.136.gate_proj.weight', 'model.layers.92.mlp.experts.136.up_proj.weight', 'model.layers.92.mlp.experts.137.down_proj.weight', 'model.layers.92.mlp.experts.137.gate_proj.weight', 'model.layers.92.mlp.experts.137.up_proj.weight', 'model.layers.92.mlp.experts.138.down_proj.weight', 'model.layers.92.mlp.experts.138.gate_proj.weight', 'model.layers.92.mlp.experts.138.up_proj.weight', 'model.layers.92.mlp.experts.139.down_proj.weight', 'model.layers.92.mlp.experts.139.gate_proj.weight', 'model.layers.92.mlp.experts.139.up_proj.weight', 'model.layers.92.mlp.experts.14.down_proj.weight', 'model.layers.92.mlp.experts.14.gate_proj.weight', 'model.layers.92.mlp.experts.14.up_proj.weight', 'model.layers.92.mlp.experts.140.down_proj.weight', 'model.layers.92.mlp.experts.140.gate_proj.weight', 'model.layers.92.mlp.experts.140.up_proj.weight', 'model.layers.92.mlp.experts.141.down_proj.weight', 'model.layers.92.mlp.experts.141.gate_proj.weight', 'model.layers.92.mlp.experts.141.up_proj.weight', 'model.layers.92.mlp.experts.142.down_proj.weight', 'model.layers.92.mlp.experts.142.gate_proj.weight', 'model.layers.92.mlp.experts.142.up_proj.weight', 'model.layers.92.mlp.experts.143.down_proj.weight', 'model.layers.92.mlp.experts.143.gate_proj.weight', 'model.layers.92.mlp.experts.143.up_proj.weight', 'model.layers.92.mlp.experts.144.down_proj.weight', 'model.layers.92.mlp.experts.144.gate_proj.weight', 'model.layers.92.mlp.experts.144.up_proj.weight', 'model.layers.92.mlp.experts.145.down_proj.weight', 'model.layers.92.mlp.experts.145.gate_proj.weight', 'model.layers.92.mlp.experts.145.up_proj.weight', 'model.layers.92.mlp.experts.146.down_proj.weight', 'model.layers.92.mlp.experts.146.gate_proj.weight', 'model.layers.92.mlp.experts.146.up_proj.weight', 'model.layers.92.mlp.experts.147.down_proj.weight', 'model.layers.92.mlp.experts.147.gate_proj.weight', 'model.layers.92.mlp.experts.147.up_proj.weight', 'model.layers.92.mlp.experts.148.down_proj.weight', 'model.layers.92.mlp.experts.148.gate_proj.weight', 'model.layers.92.mlp.experts.148.up_proj.weight', 'model.layers.92.mlp.experts.149.down_proj.weight', 'model.layers.92.mlp.experts.149.gate_proj.weight', 'model.layers.92.mlp.experts.149.up_proj.weight', 'model.layers.92.mlp.experts.15.down_proj.weight', 'model.layers.92.mlp.experts.15.gate_proj.weight', 'model.layers.92.mlp.experts.15.up_proj.weight', 'model.layers.92.mlp.experts.150.down_proj.weight', 'model.layers.92.mlp.experts.150.gate_proj.weight', 'model.layers.92.mlp.experts.150.up_proj.weight', 'model.layers.92.mlp.experts.151.down_proj.weight', 'model.layers.92.mlp.experts.151.gate_proj.weight', 'model.layers.92.mlp.experts.151.up_proj.weight', 'model.layers.92.mlp.experts.152.down_proj.weight', 'model.layers.92.mlp.experts.152.gate_proj.weight', 'model.layers.92.mlp.experts.152.up_proj.weight', 'model.layers.92.mlp.experts.153.down_proj.weight', 'model.layers.92.mlp.experts.153.gate_proj.weight', 'model.layers.92.mlp.experts.153.up_proj.weight', 'model.layers.92.mlp.experts.154.down_proj.weight', 'model.layers.92.mlp.experts.154.gate_proj.weight', 'model.layers.92.mlp.experts.154.up_proj.weight', 'model.layers.92.mlp.experts.155.down_proj.weight', 'model.layers.92.mlp.experts.155.gate_proj.weight', 'model.layers.92.mlp.experts.155.up_proj.weight', 'model.layers.92.mlp.experts.156.down_proj.weight', 'model.layers.92.mlp.experts.156.gate_proj.weight', 'model.layers.92.mlp.experts.156.up_proj.weight', 'model.layers.92.mlp.experts.157.down_proj.weight', 'model.layers.92.mlp.experts.157.gate_proj.weight', 'model.layers.92.mlp.experts.157.up_proj.weight', 'model.layers.92.mlp.experts.158.down_proj.weight', 'model.layers.92.mlp.experts.158.gate_proj.weight', 'model.layers.92.mlp.experts.158.up_proj.weight', 'model.layers.92.mlp.experts.159.down_proj.weight', 'model.layers.92.mlp.experts.159.gate_proj.weight', 'model.layers.92.mlp.experts.159.up_proj.weight', 'model.layers.92.mlp.experts.16.down_proj.weight', 'model.layers.92.mlp.experts.16.gate_proj.weight', 'model.layers.92.mlp.experts.16.up_proj.weight', 'model.layers.92.mlp.experts.17.down_proj.weight', 'model.layers.92.mlp.experts.17.gate_proj.weight', 'model.layers.92.mlp.experts.17.up_proj.weight', 'model.layers.92.mlp.experts.18.down_proj.weight', 'model.layers.92.mlp.experts.18.gate_proj.weight', 'model.layers.92.mlp.experts.18.up_proj.weight', 'model.layers.92.mlp.experts.19.down_proj.weight', 'model.layers.92.mlp.experts.19.gate_proj.weight', 'model.layers.92.mlp.experts.19.up_proj.weight', 'model.layers.92.mlp.experts.2.down_proj.weight', 'model.layers.92.mlp.experts.2.gate_proj.weight', 'model.layers.92.mlp.experts.2.up_proj.weight', 'model.layers.92.mlp.experts.20.down_proj.weight', 'model.layers.92.mlp.experts.20.gate_proj.weight', 'model.layers.92.mlp.experts.20.up_proj.weight', 'model.layers.92.mlp.experts.21.down_proj.weight', 'model.layers.92.mlp.experts.21.gate_proj.weight', 'model.layers.92.mlp.experts.21.up_proj.weight', 'model.layers.92.mlp.experts.22.down_proj.weight', 'model.layers.92.mlp.experts.22.gate_proj.weight', 'model.layers.92.mlp.experts.22.up_proj.weight', 'model.layers.92.mlp.experts.23.down_proj.weight', 'model.layers.92.mlp.experts.23.gate_proj.weight', 'model.layers.92.mlp.experts.23.up_proj.weight', 'model.layers.92.mlp.experts.24.down_proj.weight', 'model.layers.92.mlp.experts.24.gate_proj.weight', 'model.layers.92.mlp.experts.24.up_proj.weight', 'model.layers.92.mlp.experts.25.down_proj.weight', 'model.layers.92.mlp.experts.25.gate_proj.weight', 'model.layers.92.mlp.experts.25.up_proj.weight', 'model.layers.92.mlp.experts.26.down_proj.weight', 'model.layers.92.mlp.experts.26.gate_proj.weight', 'model.layers.92.mlp.experts.26.up_proj.weight', 'model.layers.92.mlp.experts.27.down_proj.weight', 'model.layers.92.mlp.experts.27.gate_proj.weight', 'model.layers.92.mlp.experts.27.up_proj.weight', 'model.layers.92.mlp.experts.28.down_proj.weight', 'model.layers.92.mlp.experts.28.gate_proj.weight', 'model.layers.92.mlp.experts.28.up_proj.weight', 'model.layers.92.mlp.experts.29.down_proj.weight', 'model.layers.92.mlp.experts.29.gate_proj.weight', 'model.layers.92.mlp.experts.29.up_proj.weight', 'model.layers.92.mlp.experts.3.down_proj.weight', 'model.layers.92.mlp.experts.3.gate_proj.weight', 'model.layers.92.mlp.experts.3.up_proj.weight', 'model.layers.92.mlp.experts.30.down_proj.weight', 'model.layers.92.mlp.experts.30.gate_proj.weight', 'model.layers.92.mlp.experts.30.up_proj.weight', 'model.layers.92.mlp.experts.31.down_proj.weight', 'model.layers.92.mlp.experts.31.gate_proj.weight', 'model.layers.92.mlp.experts.31.up_proj.weight', 'model.layers.92.mlp.experts.32.down_proj.weight', 'model.layers.92.mlp.experts.32.gate_proj.weight', 'model.layers.92.mlp.experts.32.up_proj.weight', 'model.layers.92.mlp.experts.33.down_proj.weight', 'model.layers.92.mlp.experts.33.gate_proj.weight', 'model.layers.92.mlp.experts.33.up_proj.weight', 'model.layers.92.mlp.experts.34.down_proj.weight', 'model.layers.92.mlp.experts.34.gate_proj.weight', 'model.layers.92.mlp.experts.34.up_proj.weight', 'model.layers.92.mlp.experts.35.down_proj.weight', 'model.layers.92.mlp.experts.35.gate_proj.weight', 'model.layers.92.mlp.experts.35.up_proj.weight', 'model.layers.92.mlp.experts.36.down_proj.weight', 'model.layers.92.mlp.experts.36.gate_proj.weight', 'model.layers.92.mlp.experts.36.up_proj.weight', 'model.layers.92.mlp.experts.37.down_proj.weight', 'model.layers.92.mlp.experts.37.gate_proj.weight', 'model.layers.92.mlp.experts.37.up_proj.weight', 'model.layers.92.mlp.experts.38.down_proj.weight', 'model.layers.92.mlp.experts.38.gate_proj.weight', 'model.layers.92.mlp.experts.38.up_proj.weight', 'model.layers.92.mlp.experts.39.down_proj.weight', 'model.layers.92.mlp.experts.39.gate_proj.weight', 'model.layers.92.mlp.experts.39.up_proj.weight', 'model.layers.92.mlp.experts.4.down_proj.weight', 'model.layers.92.mlp.experts.4.gate_proj.weight', 'model.layers.92.mlp.experts.4.up_proj.weight', 'model.layers.92.mlp.experts.40.down_proj.weight', 'model.layers.92.mlp.experts.40.gate_proj.weight', 'model.layers.92.mlp.experts.40.up_proj.weight', 'model.layers.92.mlp.experts.41.down_proj.weight', 'model.layers.92.mlp.experts.41.gate_proj.weight', 'model.layers.92.mlp.experts.41.up_proj.weight', 'model.layers.92.mlp.experts.42.down_proj.weight', 'model.layers.92.mlp.experts.42.gate_proj.weight', 'model.layers.92.mlp.experts.42.up_proj.weight', 'model.layers.92.mlp.experts.43.down_proj.weight', 'model.layers.92.mlp.experts.43.gate_proj.weight', 'model.layers.92.mlp.experts.43.up_proj.weight', 'model.layers.92.mlp.experts.44.down_proj.weight', 'model.layers.92.mlp.experts.44.gate_proj.weight', 'model.layers.92.mlp.experts.44.up_proj.weight', 'model.layers.92.mlp.experts.45.down_proj.weight', 'model.layers.92.mlp.experts.45.gate_proj.weight', 'model.layers.92.mlp.experts.45.up_proj.weight', 'model.layers.92.mlp.experts.46.down_proj.weight', 'model.layers.92.mlp.experts.46.gate_proj.weight', 'model.layers.92.mlp.experts.46.up_proj.weight', 'model.layers.92.mlp.experts.47.down_proj.weight', 'model.layers.92.mlp.experts.47.gate_proj.weight', 'model.layers.92.mlp.experts.47.up_proj.weight', 'model.layers.92.mlp.experts.48.down_proj.weight', 'model.layers.92.mlp.experts.48.gate_proj.weight', 'model.layers.92.mlp.experts.48.up_proj.weight', 'model.layers.92.mlp.experts.49.down_proj.weight', 'model.layers.92.mlp.experts.49.gate_proj.weight', 'model.layers.92.mlp.experts.49.up_proj.weight', 'model.layers.92.mlp.experts.5.down_proj.weight', 'model.layers.92.mlp.experts.5.gate_proj.weight', 'model.layers.92.mlp.experts.5.up_proj.weight', 'model.layers.92.mlp.experts.50.down_proj.weight', 'model.layers.92.mlp.experts.50.gate_proj.weight', 'model.layers.92.mlp.experts.50.up_proj.weight', 'model.layers.92.mlp.experts.51.down_proj.weight', 'model.layers.92.mlp.experts.51.gate_proj.weight', 'model.layers.92.mlp.experts.51.up_proj.weight', 'model.layers.92.mlp.experts.52.down_proj.weight', 'model.layers.92.mlp.experts.52.gate_proj.weight', 'model.layers.92.mlp.experts.52.up_proj.weight', 'model.layers.92.mlp.experts.53.down_proj.weight', 'model.layers.92.mlp.experts.53.gate_proj.weight', 'model.layers.92.mlp.experts.53.up_proj.weight', 'model.layers.92.mlp.experts.54.down_proj.weight', 'model.layers.92.mlp.experts.54.gate_proj.weight', 'model.layers.92.mlp.experts.54.up_proj.weight', 'model.layers.92.mlp.experts.55.down_proj.weight', 'model.layers.92.mlp.experts.55.gate_proj.weight', 'model.layers.92.mlp.experts.55.up_proj.weight', 'model.layers.92.mlp.experts.56.down_proj.weight', 'model.layers.92.mlp.experts.56.gate_proj.weight', 'model.layers.92.mlp.experts.56.up_proj.weight', 'model.layers.92.mlp.experts.57.down_proj.weight', 'model.layers.92.mlp.experts.57.gate_proj.weight', 'model.layers.92.mlp.experts.57.up_proj.weight', 'model.layers.92.mlp.experts.58.down_proj.weight', 'model.layers.92.mlp.experts.58.gate_proj.weight', 'model.layers.92.mlp.experts.58.up_proj.weight', 'model.layers.92.mlp.experts.59.down_proj.weight', 'model.layers.92.mlp.experts.59.gate_proj.weight', 'model.layers.92.mlp.experts.59.up_proj.weight', 'model.layers.92.mlp.experts.6.down_proj.weight', 'model.layers.92.mlp.experts.6.gate_proj.weight', 'model.layers.92.mlp.experts.6.up_proj.weight', 'model.layers.92.mlp.experts.60.down_proj.weight', 'model.layers.92.mlp.experts.60.gate_proj.weight', 'model.layers.92.mlp.experts.60.up_proj.weight', 'model.layers.92.mlp.experts.61.down_proj.weight', 'model.layers.92.mlp.experts.61.gate_proj.weight', 'model.layers.92.mlp.experts.61.up_proj.weight', 'model.layers.92.mlp.experts.62.down_proj.weight', 'model.layers.92.mlp.experts.62.gate_proj.weight', 'model.layers.92.mlp.experts.62.up_proj.weight', 'model.layers.92.mlp.experts.63.down_proj.weight', 'model.layers.92.mlp.experts.63.gate_proj.weight', 'model.layers.92.mlp.experts.63.up_proj.weight', 'model.layers.92.mlp.experts.64.down_proj.weight', 'model.layers.92.mlp.experts.64.gate_proj.weight', 'model.layers.92.mlp.experts.64.up_proj.weight', 'model.layers.92.mlp.experts.65.down_proj.weight', 'model.layers.92.mlp.experts.65.gate_proj.weight', 'model.layers.92.mlp.experts.65.up_proj.weight', 'model.layers.92.mlp.experts.66.down_proj.weight', 'model.layers.92.mlp.experts.66.gate_proj.weight', 'model.layers.92.mlp.experts.66.up_proj.weight', 'model.layers.92.mlp.experts.67.down_proj.weight', 'model.layers.92.mlp.experts.67.gate_proj.weight', 'model.layers.92.mlp.experts.67.up_proj.weight', 'model.layers.92.mlp.experts.68.down_proj.weight', 'model.layers.92.mlp.experts.68.gate_proj.weight', 'model.layers.92.mlp.experts.68.up_proj.weight', 'model.layers.92.mlp.experts.69.down_proj.weight', 'model.layers.92.mlp.experts.69.gate_proj.weight', 'model.layers.92.mlp.experts.69.up_proj.weight', 'model.layers.92.mlp.experts.7.down_proj.weight', 'model.layers.92.mlp.experts.7.gate_proj.weight', 'model.layers.92.mlp.experts.7.up_proj.weight', 'model.layers.92.mlp.experts.70.down_proj.weight', 'model.layers.92.mlp.experts.70.gate_proj.weight', 'model.layers.92.mlp.experts.70.up_proj.weight', 'model.layers.92.mlp.experts.71.down_proj.weight', 'model.layers.92.mlp.experts.71.gate_proj.weight', 'model.layers.92.mlp.experts.71.up_proj.weight', 'model.layers.92.mlp.experts.72.down_proj.weight', 'model.layers.92.mlp.experts.72.gate_proj.weight', 'model.layers.92.mlp.experts.72.up_proj.weight', 'model.layers.92.mlp.experts.73.down_proj.weight', 'model.layers.92.mlp.experts.73.gate_proj.weight', 'model.layers.92.mlp.experts.73.up_proj.weight', 'model.layers.92.mlp.experts.74.down_proj.weight', 'model.layers.92.mlp.experts.74.gate_proj.weight', 'model.layers.92.mlp.experts.74.up_proj.weight', 'model.layers.92.mlp.experts.75.down_proj.weight', 'model.layers.92.mlp.experts.75.gate_proj.weight', 'model.layers.92.mlp.experts.75.up_proj.weight', 'model.layers.92.mlp.experts.76.down_proj.weight', 'model.layers.92.mlp.experts.76.gate_proj.weight', 'model.layers.92.mlp.experts.76.up_proj.weight', 'model.layers.92.mlp.experts.77.down_proj.weight', 'model.layers.92.mlp.experts.77.gate_proj.weight', 'model.layers.92.mlp.experts.77.up_proj.weight', 'model.layers.92.mlp.experts.78.down_proj.weight', 'model.layers.92.mlp.experts.78.gate_proj.weight', 'model.layers.92.mlp.experts.78.up_proj.weight', 'model.layers.92.mlp.experts.79.down_proj.weight', 'model.layers.92.mlp.experts.79.gate_proj.weight', 'model.layers.92.mlp.experts.79.up_proj.weight', 'model.layers.92.mlp.experts.8.down_proj.weight', 'model.layers.92.mlp.experts.8.gate_proj.weight', 'model.layers.92.mlp.experts.8.up_proj.weight', 'model.layers.92.mlp.experts.80.down_proj.weight', 'model.layers.92.mlp.experts.80.gate_proj.weight', 'model.layers.92.mlp.experts.80.up_proj.weight', 'model.layers.92.mlp.experts.81.down_proj.weight', 'model.layers.92.mlp.experts.81.gate_proj.weight', 'model.layers.92.mlp.experts.81.up_proj.weight', 'model.layers.92.mlp.experts.82.down_proj.weight', 'model.layers.92.mlp.experts.82.gate_proj.weight', 'model.layers.92.mlp.experts.82.up_proj.weight', 'model.layers.92.mlp.experts.83.down_proj.weight', 'model.layers.92.mlp.experts.83.gate_proj.weight', 'model.layers.92.mlp.experts.83.up_proj.weight', 'model.layers.92.mlp.experts.84.down_proj.weight', 'model.layers.92.mlp.experts.84.gate_proj.weight', 'model.layers.92.mlp.experts.84.up_proj.weight', 'model.layers.92.mlp.experts.85.down_proj.weight', 'model.layers.92.mlp.experts.85.gate_proj.weight', 'model.layers.92.mlp.experts.85.up_proj.weight', 'model.layers.92.mlp.experts.86.down_proj.weight', 'model.layers.92.mlp.experts.86.gate_proj.weight', 'model.layers.92.mlp.experts.86.up_proj.weight', 'model.layers.92.mlp.experts.87.down_proj.weight', 'model.layers.92.mlp.experts.87.gate_proj.weight', 'model.layers.92.mlp.experts.87.up_proj.weight', 'model.layers.92.mlp.experts.88.down_proj.weight', 'model.layers.92.mlp.experts.88.gate_proj.weight', 'model.layers.92.mlp.experts.88.up_proj.weight', 'model.layers.92.mlp.experts.89.down_proj.weight', 'model.layers.92.mlp.experts.89.gate_proj.weight', 'model.layers.92.mlp.experts.89.up_proj.weight', 'model.layers.92.mlp.experts.9.down_proj.weight', 'model.layers.92.mlp.experts.9.gate_proj.weight', 'model.layers.92.mlp.experts.9.up_proj.weight', 'model.layers.92.mlp.experts.90.down_proj.weight', 'model.layers.92.mlp.experts.90.gate_proj.weight', 'model.layers.92.mlp.experts.90.up_proj.weight', 'model.layers.92.mlp.experts.91.down_proj.weight', 'model.layers.92.mlp.experts.91.gate_proj.weight', 'model.layers.92.mlp.experts.91.up_proj.weight', 'model.layers.92.mlp.experts.92.down_proj.weight', 'model.layers.92.mlp.experts.92.gate_proj.weight', 'model.layers.92.mlp.experts.92.up_proj.weight', 'model.layers.92.mlp.experts.93.down_proj.weight', 'model.layers.92.mlp.experts.93.gate_proj.weight', 'model.layers.92.mlp.experts.93.up_proj.weight', 'model.layers.92.mlp.experts.94.down_proj.weight', 'model.layers.92.mlp.experts.94.gate_proj.weight', 'model.layers.92.mlp.experts.94.up_proj.weight', 'model.layers.92.mlp.experts.95.down_proj.weight', 'model.layers.92.mlp.experts.95.gate_proj.weight', 'model.layers.92.mlp.experts.95.up_proj.weight', 'model.layers.92.mlp.experts.96.down_proj.weight', 'model.layers.92.mlp.experts.96.gate_proj.weight', 'model.layers.92.mlp.experts.96.up_proj.weight', 'model.layers.92.mlp.experts.97.down_proj.weight', 'model.layers.92.mlp.experts.97.gate_proj.weight', 'model.layers.92.mlp.experts.97.up_proj.weight', 'model.layers.92.mlp.experts.98.down_proj.weight', 'model.layers.92.mlp.experts.98.gate_proj.weight', 'model.layers.92.mlp.experts.98.up_proj.weight', 'model.layers.92.mlp.experts.99.down_proj.weight', 'model.layers.92.mlp.experts.99.gate_proj.weight', 'model.layers.92.mlp.experts.99.up_proj.weight', 'model.layers.92.mlp.gate.e_score_correction_bias', 'model.layers.92.mlp.gate.weight', 'model.layers.92.mlp.shared_experts.down_proj.weight', 'model.layers.92.mlp.shared_experts.gate_proj.weight', 'model.layers.92.mlp.shared_experts.up_proj.weight', 'model.layers.92.post_attention_layernorm.weight', 'model.layers.92.self_attn.k_norm.weight', 'model.layers.92.self_attn.k_proj.bias', 'model.layers.92.self_attn.k_proj.weight', 'model.layers.92.self_attn.o_proj.weight', 'model.layers.92.self_attn.q_norm.weight', 'model.layers.92.self_attn.q_proj.bias', 'model.layers.92.self_attn.q_proj.weight', 'model.layers.92.self_attn.v_proj.bias', 'model.layers.92.self_attn.v_proj.weight', 'model.layers.92.shared_head.norm.weight']
- This IS expected if you are initializing Glm4MoeForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Glm4MoeForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
The module name (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
๐ Loading dataset: HuggingFaceH4/ultrachat_200k
โ
Dataset prepared with 512 samples
โ๏ธ Setting up W4A16 quantization recipe...
๐ง Ignoring the following patterns from quantization:
lm_head
re:.*\.mlp\.gate$
re:.*\.self_attn\..*$
re:.*\.shared_expert\..*$
re:.*\.shared_experts\..*$
re:.*\.mlp\.shared_expert_gate$
re:.*\.linear_attn\..*$
๐ re:model\.layers\.[0-2]\.mlp\..*$
๐ฏ Starting one-shot quantization...
2025-11-25T10:29:31.549728+0800 | reset | INFO - Compression lifecycle reset
2025-11-25T10:29:31.644540+0800 | _create_default_logger | INFO - Logging all LLM Compressor modifier-level logs to sparse_logs/25-11-2025_10.29.31.log
2025-11-25T10:29:31.644825+0800 | from_modifiers | INFO - Creating recipe from modifiers
2025-11-25T10:29:39.480906+0800 | initialize | INFO - Compression lifecycle initialized for 1 modifiers
2025-11-25T10:29:39.481037+0800 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `GPTQModifier`
Preparing cache: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 512/512 [00:01<00:00, 270.65it/s]
(1/93): Calibrating: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 512/512 [00:07<00:00, 68.23it/s]
(1/93): Propagating: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 512/512 [00:09<00:00, 53.89it/s]
(2/93): Calibrating: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 512/512 [00:07<00:00, 68.92it/s]
(2/93): Propagating: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 512/512 [00:08<00:00, 59.74it/s]
(3/93): Calibrating: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 512/512 [00:07<00:00, 68.34it/s]
(3/93): Propagating: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 512/512 [00:07<00:00, 64.90it/s]
(4/93): Calibrating: 0%| | 0/512 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/pipelines/sequential/helpers.py", line 73, in forward
outputs = forward_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<string>", line 5, in forward
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/modeling_layers.py", line 94, in __call__
return super().__call__(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 395, in forward
hidden_states = self.mlp(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 345, in forward
hidden_states = self.moe(hidden_states, topk_indices, topk_weights).view(*orig_shape)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 331, in moe
expert_output = expert(expert_input)
^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 223, in forward
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1879, in _call_impl
return inner()
^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1840, in inner
hook_result = hook(self, args, result)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/modifiers/utils/hooks.py", line 93, in wrapped_hook
return hook(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/gptq/base.py", line 230, in calibrate_module
self._hessians[module] = make_empty_hessian(module, device=init_device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/gptq/gptq_quantize.py", line 30, in make_empty_hessian
return torch.zeros((num_columns, num_columns), device=device, dtype=GPTQ_PRECISION)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB. GPU 0 has a total capacity of 23.57 GiB of which 15.19 MiB is free. Process 1797596 has 254.00 MiB memory in use. Including non-PyTorch memory, this process has 23.28 GiB memory in use. Of the allocated memory 22.41 GiB is allocated by PyTorch, and 580.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/work/ktransformers/ktransformers/kt-kernel/scripts/convert_gpu_weights.py", line 450, in <module>
main()
File "/work/ktransformers/ktransformers/kt-kernel/scripts/convert_gpu_weights.py", line 434, in main
oneshot(
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/entrypoints/oneshot.py", line 330, in oneshot
one_shot()
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/entrypoints/oneshot.py", line 158, in __call__
self.apply_recipe_modifiers(
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/entrypoints/oneshot.py", line 201, in apply_recipe_modifiers
pipeline(
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/pipelines/independent/pipeline.py", line 45, in __call__
pipeline(model, dataloader, dataset_args)
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/pipelines/sequential/pipeline.py", line 104, in __call__
subgraph.forward(model, **inputs)
File "/root/miniconda3/envs/kt/lib/python3.11/site-packages/llmcompressor/pipelines/sequential/helpers.py", line 75, in forward
raise RuntimeError(
RuntimeError: Raised an exception during execution of the following code:
1
2
3
4 def forward(self, wrapped_5, model_layers_2, getitem_1, model_rotary_emb, getitem_3):
5 model_layers_3 = getattr(self.model.layers, "3")(model_layers_2, attention_mask = wrapped_5, position_ids = getitem_3, past_key_values = None, cache_position = getitem_1, position_embeddings = model_rotary_emb); model_layers_2 = wrapped_5 = getitem_3 = getitem_1 = model_rotary_emb = None
6 return {'model_layers_3': model_layers_3}
7