[Bug]Qwen3-Coder-30B-A3B GGUF Model Expert Operators Compatibility Issues - KExpertsMarlin KeyError and KExpertsTorch NoneType Error

Open BG8CFB opened this issue 4 months ago • 0 comments

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
[x] 5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.

Describe the bug

@jinmmd @jizhilong @sayap @sammcj @twobob @ 中文描述 / Chinese Description: 在使用KTransformers部署Unsloth的Qwen3-Coder-30B-A3B-Instruct GGUF模型时遇到两个相关的专家操作符兼容性问题：

KExpertsMarlin权重命名不匹配: 配置前24个专家使用KExpertsMarlin操作符时出现KeyError: 'model.layers.0.mlp.experts.ffn_up_exps.weight'错误
KExpertsTorch NoneType错误: 改用KExpertsTorch操作符后出现TypeError: 'NoneType' object is not subscriptable错误

根本原因是GGUF格式模型的权重命名格式与专家操作符期望的格式不匹配：

GGUF格式使用: blk.{layer_id}.ffn_*_exps.weight
操作符期望: model.layers.{layer_id}.mlp.experts.ffn_*_exps.weight

English Description: Encountered two related expert operator compatibility issues when deploying Unsloth's Qwen3-Coder-30B-A3B-Instruct GGUF model with KTransformers:

KExpertsMarlin weight naming mismatch: KeyError: 'model.layers.0.mlp.experts.ffn_up_exps.weight' when configuring first 24 experts to use KExpertsMarlin operator
KExpertsTorch NoneType error: TypeError: 'NoneType' object is not subscriptable when switching to KExpertsTorch operator

Root cause is the mismatch between GGUF model weight naming format and expert operators' expected format:

GGUF format uses: blk.{layer_id}.ffn_*_exps.weight
Operators expect: model.layers.{layer_id}.mlp.experts.ffn_*_exps.weight

Reproduction

模型信息 / Model Information:

Model: Unsloth Qwen3-Coder-30B-A3B-Instruct (GGUF Q4_K_M quantized)
Source: Hugging Face model converted to GGUF format
Architecture: Qwen3MoeForCausalLM (MoE model with 128 experts, 8 experts per token)

复现步骤 / Reproduction Steps:

下载模型 / Download Model:

# Download Qwen3-Coder-30B-A3B-Instruct GGUF model
# Place in: /root/ktransformers_models/qwen3-coder-30b/quantized_Q4_K_M/
# Original config in: /root/ktransformers_models/qwen3-coder-30b/original_config/

配置优化规则 / Configure Optimization Rules:

# optimize_config.yaml - First attempt with KExpertsMarlin
- match:
    name: "^model\\.layers\\.(0|[1-9]|1[0-9]|2[0-3])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExpertsV2
    kwargs:
      generate_device: "cuda"
      generate_op: "KExpertsMarlin"  # This causes KeyError
      prefill_device: "cuda"
      prefill_op: "KExpertsTorch"

启动命令 / Launch Command:

#!/bin/bash
source /opt/miniconda3/etc/profile.d/conda.sh
conda activate kt

python3 /opt/kt/ktransformers/server/main.py \
  --model_path "/root/ktransformers_models/qwen3-coder-30b/original_config" \
  --gguf_path "/root/ktransformers_models/qwen3-coder-30b/quantized_Q4_K_M" \
  --architectures Qwen3MoeForCausalLM \
  --optimize_config_path "/mnt/d/code/kT/deployment_Kt/output/optimize_config.yaml" \
  --cpu_infer 18 \
  --max_batch_size 4 \
  --backend_type balance_serve \
  --port 8000 \
  --chunk_size 1024 \
  --cache_lens 16384 \
  --max_new_tokens 4096

错误1 - KExpertsMarlin / Error 1 - KExpertsMarlin: KeyError: 'model.layers.0.mlp.experts.ffn_up_exps.weight'
修改配置使用KExpertsTorch / Modified config to use KExpertsTorch:

# Changed generate_op to KExpertsTorch
generate_op: "KExpertsTorch"  # This causes NoneType error

错误2 - KExpertsTorch / Error 2 - KExpertsTorch: TypeError: 'NoneType' object is not subscriptable File "ktransformers/util/custom_loader.py", line 415, in load_expert_tensor data = data[offset: offset + block_size * blocks_per_experts]

权重命名分析 / Weight Naming Analysis: 使用gguf库检查发现实际权重命名为：

# Actual GGUF weights:
blk.0.ffn_gate_exps.weight
blk.0.ffn_up_exps.weight  
blk.0.ffn_down_exps.weight

# Expected by operators:
model.layers.0.mlp.experts.ffn_gate_exps.weight
model.layers.0.mlp.experts.ffn_up_exps.weight
model.layers.0.mlp.experts.ffn_down_exps.weight

Environment

系统环境 / System Environment:

OS: WSL Ubuntu 24.04.1 LTS
Python: 3.12.3
Conda Environment: kt
CUDA: Available (GPU 0)

KTransformers配置 / KTransformers Configuration:

Installation: Source installation in /opt/kt/
Version: Latest from main branch
Backend: balance_serve
Device Configuration: CUDA GPU + CPU hybrid

模型配置 / Model Configuration:

Model Path: /root/ktransformers_models/qwen3-coder-30b/original_config
GGUF Path: /root/ktransformers_models/qwen3-coder-30b/quantized_Q4_K_M
Model Type: Qwen3MoeForCausalLM
Quantization: Q4_K_M
Total Layers: 24
Total Experts: 128
Experts Per Token: 8

硬件配置 / Hardware Configuration:

CPU: 18 cores allocated for CPU inference
GPU: CUDA-capable GPU for first 24 experts
Memory: Sufficient for model loading

Aug 07 '25 12:08 BG8CFB