ktransformers icon indicating copy to clipboard operation
ktransformers copied to clipboard

[Bug]Qwen3-Coder-30B-A3B GGUF Model Expert Operators Compatibility Issues - KExpertsMarlin KeyError and KExpertsTorch NoneType Error

Open BG8CFB opened this issue 4 months ago • 0 comments

Checklist

  • [x] 1. I have searched related issues but cannot get the expected help.
  • [x] 2. The bug has not been fixed in the latest version.
  • [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • [ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
  • [x] 5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.

Describe the bug

@jinmmd @jizhilong @sayap @sammcj @twobob @ 中文描述 / Chinese Description: 在使用KTransformers部署Unsloth的Qwen3-Coder-30B-A3B-Instruct GGUF模型时遇到两个相关的专家操作符兼容性问题:

  1. KExpertsMarlin权重命名不匹配: 配置前24个专家使用KExpertsMarlin操作符时出现KeyError: 'model.layers.0.mlp.experts.ffn_up_exps.weight'错误
  2. KExpertsTorch NoneType错误: 改用KExpertsTorch操作符后出现TypeError: 'NoneType' object is not subscriptable错误

根本原因是GGUF格式模型的权重命名格式与专家操作符期望的格式不匹配:

  • GGUF格式使用: blk.{layer_id}.ffn_*_exps.weight
  • 操作符期望: model.layers.{layer_id}.mlp.experts.ffn_*_exps.weight

English Description: Encountered two related expert operator compatibility issues when deploying Unsloth's Qwen3-Coder-30B-A3B-Instruct GGUF model with KTransformers:

  1. KExpertsMarlin weight naming mismatch: KeyError: 'model.layers.0.mlp.experts.ffn_up_exps.weight' when configuring first 24 experts to use KExpertsMarlin operator
  2. KExpertsTorch NoneType error: TypeError: 'NoneType' object is not subscriptable when switching to KExpertsTorch operator

Root cause is the mismatch between GGUF model weight naming format and expert operators' expected format:

  • GGUF format uses: blk.{layer_id}.ffn_*_exps.weight
  • Operators expect: model.layers.{layer_id}.mlp.experts.ffn_*_exps.weight

Reproduction

模型信息 / Model Information:

  • Model: Unsloth Qwen3-Coder-30B-A3B-Instruct (GGUF Q4_K_M quantized)
  • Source: Hugging Face model converted to GGUF format
  • Architecture: Qwen3MoeForCausalLM (MoE model with 128 experts, 8 experts per token)

复现步骤 / Reproduction Steps:

  1. 下载模型 / Download Model:
# Download Qwen3-Coder-30B-A3B-Instruct GGUF model
# Place in: /root/ktransformers_models/qwen3-coder-30b/quantized_Q4_K_M/
# Original config in: /root/ktransformers_models/qwen3-coder-30b/original_config/
  1. 配置优化规则 / Configure Optimization Rules:
# optimize_config.yaml - First attempt with KExpertsMarlin
- match:
    name: "^model\\.layers\\.(0|[1-9]|1[0-9]|2[0-3])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExpertsV2
    kwargs:
      generate_device: "cuda"
      generate_op: "KExpertsMarlin"  # This causes KeyError
      prefill_device: "cuda"
      prefill_op: "KExpertsTorch"
  1. 启动命令 / Launch Command:
#!/bin/bash
source /opt/miniconda3/etc/profile.d/conda.sh
conda activate kt

python3 /opt/kt/ktransformers/server/main.py \
  --model_path "/root/ktransformers_models/qwen3-coder-30b/original_config" \
  --gguf_path "/root/ktransformers_models/qwen3-coder-30b/quantized_Q4_K_M" \
  --architectures Qwen3MoeForCausalLM \
  --optimize_config_path "/mnt/d/code/kT/deployment_Kt/output/optimize_config.yaml" \
  --cpu_infer 18 \
  --max_batch_size 4 \
  --backend_type balance_serve \
  --port 8000 \
  --chunk_size 1024 \
  --cache_lens 16384 \
  --max_new_tokens 4096
  1. 错误1 - KExpertsMarlin / Error 1 - KExpertsMarlin: KeyError: 'model.layers.0.mlp.experts.ffn_up_exps.weight'

  2. 修改配置使用KExpertsTorch / Modified config to use KExpertsTorch:

# Changed generate_op to KExpertsTorch
generate_op: "KExpertsTorch"  # This causes NoneType error
  1. 错误2 - KExpertsTorch / Error 2 - KExpertsTorch: TypeError: 'NoneType' object is not subscriptable File "ktransformers/util/custom_loader.py", line 415, in load_expert_tensor data = data[offset: offset + block_size * blocks_per_experts]

权重命名分析 / Weight Naming Analysis: 使用gguf库检查发现实际权重命名为:

# Actual GGUF weights:
blk.0.ffn_gate_exps.weight
blk.0.ffn_up_exps.weight  
blk.0.ffn_down_exps.weight

# Expected by operators:
model.layers.0.mlp.experts.ffn_gate_exps.weight
model.layers.0.mlp.experts.ffn_up_exps.weight
model.layers.0.mlp.experts.ffn_down_exps.weight

Environment

系统环境 / System Environment:

  • OS: WSL Ubuntu 24.04.1 LTS
  • Python: 3.12.3
  • Conda Environment: kt
  • CUDA: Available (GPU 0)

KTransformers配置 / KTransformers Configuration:

  • Installation: Source installation in /opt/kt/
  • Version: Latest from main branch
  • Backend: balance_serve
  • Device Configuration: CUDA GPU + CPU hybrid

模型配置 / Model Configuration:

  • Model Path: /root/ktransformers_models/qwen3-coder-30b/original_config
  • GGUF Path: /root/ktransformers_models/qwen3-coder-30b/quantized_Q4_K_M
  • Model Type: Qwen3MoeForCausalLM
  • Quantization: Q4_K_M
  • Total Layers: 24
  • Total Experts: 128
  • Experts Per Token: 8

硬件配置 / Hardware Configuration:

  • CPU: 18 cores allocated for CPU inference
  • GPU: CUDA-capable GPU for first 24 experts
  • Memory: Sufficient for model loading

BG8CFB avatar Aug 07 '25 12:08 BG8CFB