[Bug]Qwen3-Coder-30B-A3B GGUF Model Expert Operators Compatibility Issues - KExpertsMarlin KeyError and KExpertsTorch NoneType Error
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
- [x] 5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.
Describe the bug
@jinmmd @jizhilong @sayap @sammcj @twobob @ 中文描述 / Chinese Description: 在使用KTransformers部署Unsloth的Qwen3-Coder-30B-A3B-Instruct GGUF模型时遇到两个相关的专家操作符兼容性问题:
- KExpertsMarlin权重命名不匹配: 配置前24个专家使用
KExpertsMarlin操作符时出现KeyError: 'model.layers.0.mlp.experts.ffn_up_exps.weight'错误 - KExpertsTorch NoneType错误: 改用
KExpertsTorch操作符后出现TypeError: 'NoneType' object is not subscriptable错误
根本原因是GGUF格式模型的权重命名格式与专家操作符期望的格式不匹配:
- GGUF格式使用:
blk.{layer_id}.ffn_*_exps.weight - 操作符期望:
model.layers.{layer_id}.mlp.experts.ffn_*_exps.weight
English Description: Encountered two related expert operator compatibility issues when deploying Unsloth's Qwen3-Coder-30B-A3B-Instruct GGUF model with KTransformers:
- KExpertsMarlin weight naming mismatch:
KeyError: 'model.layers.0.mlp.experts.ffn_up_exps.weight'when configuring first 24 experts to useKExpertsMarlinoperator - KExpertsTorch NoneType error:
TypeError: 'NoneType' object is not subscriptablewhen switching toKExpertsTorchoperator
Root cause is the mismatch between GGUF model weight naming format and expert operators' expected format:
- GGUF format uses:
blk.{layer_id}.ffn_*_exps.weight - Operators expect:
model.layers.{layer_id}.mlp.experts.ffn_*_exps.weight
Reproduction
模型信息 / Model Information:
- Model: Unsloth Qwen3-Coder-30B-A3B-Instruct (GGUF Q4_K_M quantized)
- Source: Hugging Face model converted to GGUF format
- Architecture: Qwen3MoeForCausalLM (MoE model with 128 experts, 8 experts per token)
复现步骤 / Reproduction Steps:
- 下载模型 / Download Model:
# Download Qwen3-Coder-30B-A3B-Instruct GGUF model
# Place in: /root/ktransformers_models/qwen3-coder-30b/quantized_Q4_K_M/
# Original config in: /root/ktransformers_models/qwen3-coder-30b/original_config/
- 配置优化规则 / Configure Optimization Rules:
# optimize_config.yaml - First attempt with KExpertsMarlin
- match:
name: "^model\\.layers\\.(0|[1-9]|1[0-9]|2[0-3])\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExpertsV2
kwargs:
generate_device: "cuda"
generate_op: "KExpertsMarlin" # This causes KeyError
prefill_device: "cuda"
prefill_op: "KExpertsTorch"
- 启动命令 / Launch Command:
#!/bin/bash
source /opt/miniconda3/etc/profile.d/conda.sh
conda activate kt
python3 /opt/kt/ktransformers/server/main.py \
--model_path "/root/ktransformers_models/qwen3-coder-30b/original_config" \
--gguf_path "/root/ktransformers_models/qwen3-coder-30b/quantized_Q4_K_M" \
--architectures Qwen3MoeForCausalLM \
--optimize_config_path "/mnt/d/code/kT/deployment_Kt/output/optimize_config.yaml" \
--cpu_infer 18 \
--max_batch_size 4 \
--backend_type balance_serve \
--port 8000 \
--chunk_size 1024 \
--cache_lens 16384 \
--max_new_tokens 4096
-
错误1 - KExpertsMarlin / Error 1 - KExpertsMarlin: KeyError: 'model.layers.0.mlp.experts.ffn_up_exps.weight'
-
修改配置使用KExpertsTorch / Modified config to use KExpertsTorch:
# Changed generate_op to KExpertsTorch
generate_op: "KExpertsTorch" # This causes NoneType error
- 错误2 - KExpertsTorch / Error 2 - KExpertsTorch: TypeError: 'NoneType' object is not subscriptable File "ktransformers/util/custom_loader.py", line 415, in load_expert_tensor data = data[offset: offset + block_size * blocks_per_experts]
权重命名分析 / Weight Naming Analysis: 使用gguf库检查发现实际权重命名为:
# Actual GGUF weights:
blk.0.ffn_gate_exps.weight
blk.0.ffn_up_exps.weight
blk.0.ffn_down_exps.weight
# Expected by operators:
model.layers.0.mlp.experts.ffn_gate_exps.weight
model.layers.0.mlp.experts.ffn_up_exps.weight
model.layers.0.mlp.experts.ffn_down_exps.weight
Environment
系统环境 / System Environment:
- OS: WSL Ubuntu 24.04.1 LTS
- Python: 3.12.3
- Conda Environment: kt
- CUDA: Available (GPU 0)
KTransformers配置 / KTransformers Configuration:
- Installation: Source installation in
/opt/kt/ - Version: Latest from main branch
- Backend: balance_serve
- Device Configuration: CUDA GPU + CPU hybrid
模型配置 / Model Configuration:
- Model Path:
/root/ktransformers_models/qwen3-coder-30b/original_config - GGUF Path:
/root/ktransformers_models/qwen3-coder-30b/quantized_Q4_K_M - Model Type: Qwen3MoeForCausalLM
- Quantization: Q4_K_M
- Total Layers: 24
- Total Experts: 128
- Experts Per Token: 8
硬件配置 / Hardware Configuration:
- CPU: 18 cores allocated for CPU inference
- GPU: CUDA-capable GPU for first 24 experts
- Memory: Sufficient for model loading