[Bug] Cannot run Qwen3-30B-A3B at all
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
- [x] 5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.
Describe the bug
Local chat fails and server does not open port.
Reproduction
Installation
I followed the one in the docs and also tried to compile my own version using USE_BALANCE_SERVE=1 KTRANSFORMERS_FORCE_BUILD=TRUE bash install.sh when the non-force-built version doesn't work.
Models
I've tried 3 models, none of which can start:
- Deepseek-r1-0528-8b (qwen3, although I understand why it doesn't start, but still worth mentioning)
- unsloth/Qwen3-30B-A3B-GGUF Q4_K_M (someone mentioned that unsloth's quants have issues with ktransformers, so I switched to another)
- lmstudio-community/Qwen3-30B-A3B-GGUF Q4_K_M (which still doesn't work, same errors as the previous) Models metadata is in ./Qwen3-30B-A3B and the GGUF is in ./Qwen3-30B-A3B-GGUF/lmstudio-community/ (etc, I'm pretty sure it should be correct)
pip list
Please refer to https://gist.github.com/moohr/d8f0a595c3dcb77961c9c0d800abcb1d
Local-chat
TypeError: KQwen3MoeAttention.forward() got an unexpected keyword argument 'attention_mask'
Full log: https://gist.github.com/moohr/415f7daf67cabe9ed198efe1c64d0c1d
Server
Launch commands:
python ktransformers/server/main.py \
--architectures Qwen3MoeForCausalLM \
--model_path ./Qwen3-30B-A3B \
--gguf_path ./Qwen3-30B-A3B-GGUF/lmstudio-community/ \
--optimize_config_path ktransformers/optimize/optimize_rules/Qwen3Moe-serve.yaml \
--backend_type balance_serve \
--port 10002 \
--cpu_infer 56 \
--chunk_size 256 \
--max_new_tokens 4096 \
--max_batch_size 4 \
--cache_lens 16384 \
--web True
It seems to be stuck on sched_rpc. Using netstat, I can see there is a single open port, but not 10002. If I try to curl the port, it will just return 403.
loading model.layers.47.input_layernorm.weight to cuda
loading model.layers.47.post_attention_layernorm.weight to cuda
loading model.norm.weight to cuda
Getting inference context from sched_client.
sched_rpc started with PID: 2327
^CReceived signal 2, shutting down...
Cleaning up...
Process SpawnProcess-1:
Terminating subprocess 2150
/home/mhr/miniconda3/envs/ktransformers/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 13 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Full log: https://gist.github.com/moohr/b1a45a3901a5128859d9b81c2c1085f3
rpc.log
I see a lot of issues where the rpc.log is being asked for, so I might as well just include it here
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[2025-07-15 10:58:46.702] [info] [scheduler.cpp:31] Number of available GPUs: 1, want 1
[2025-07-15 10:58:46.702] [info] [scheduler.cpp:66] Each GPU Total: 1536MiB, Model Params: 0MiB, KVCache: 1536MiB, Left: 0MiB
[2025-07-15 10:58:46.702] [info] [scheduler.cpp:87] total_kvcache_pages is auto derived as 64
[2025-07-15 10:58:46.703] [info] [scheduler.cpp:933] Using Strategy FCFS
[2025-07-15 10:58:46.703] [info] [scheduler.cpp:459]
Scheduler Settings:
model_name: Qwen3-30B-A3B
quant_type: BF16
model_path: ./Qwen3-30B-A3B
params_count: 0
layer_count: 48
num_k_heads: 4
k_head_dim: 128
bytes_per_params: 2
bytes_per_kv_cache_element: 2
page_size: 256
gpu_device_id: 0
gpu_memory_size: 1.61G
memory_utilization_percentage: 1
max_batch_size: 4
recommended_chunk_prefill_token_count: 127
sched_metrics_port: 52125
kvc2_config_path: /home/mhr/.ktransformers/kvc2
kvc2_root_path: /home/mhr/ktransformers/kvcache/
memory_pool_size_GB: 64
evict_count: 40
kvc2_metrics_port: 54697
load_from_disk: false
save_to_disk: true
strategy_name: FCFS
gpu_device_count: 1
load_model_configs from "/home/mhr/.ktransformers/kvc2/model_configs.json"
Loaded Model Configs
- Qwen3-30B-A3B
Load from "./Qwen3-30B-A3B/config.json"
[2025-07-15 10:58:46.704] [info] [prefix.cpp:1372] Creating KVC2 using these config
[2025-07-15 10:58:46.704] [info] [prefix.cpp:1373] GPU Only: false
[2025-07-15 10:58:46.704] [info] [prefix.cpp:1374] Load: false, Save: true
[2025-07-15 10:58:46.704] [info] [prefix.cpp:1375] Path: /home/mhr/ktransformers/kvcache/
[2025-07-15 10:58:46.704] [info] [prefix.cpp:1376] Config Path: /home/mhr/.ktransformers/kvc2
[2025-07-15 10:58:46.704] [info] [prefix.cpp:1377] Num Token/Page: 256, Memory Pool Size: 64.00G
[2025-07-15 10:58:46.704] [info] [prefix.cpp:1379] Evict Count: 40, Metrics Port: 54697
[2025-07-15 10:58:46.704] [info] [prefix.cpp:1380] Recompute Ratio: 0.20
[2025-07-15 10:58:46.704] [info] [prefix.cpp:1384] GPU Devices: 0
[2025-07-15 10:58:46.705] [info] [prefix.cpp:1385] Layer Count: 48, Total KVCache Pages: 64
[2025-07-15 10:58:46.705] [info] [prefix.cpp:1387] Num Token/Page: 256, Num K Heads: 4
[2025-07-15 10:58:46.705] [info] [prefix.cpp:1388] K Head Dim: 128, Tensor Type: 15
[2025-07-15 10:58:46.705] [info] [prefix.cpp:1390] MemcpyCudaStreams/Device: 4
load_model_configs from "/home/mhr/.ktransformers/kvc2/model_configs.json"
Loaded Model Configs
- Qwen3-30B-A3B
load_quant_configs from "/home/mhr/.ktransformers/kvc2/quant_configs.json"
Loaded Quant Configs
- BF16
- FP16
- FP32
- Q4_0
- Q8_0
[2025-07-15 10:58:46.706] [info] [prefix.cpp:1401] Creating kvc2 metrics exporter on 0.0.0.0:54697
[2025-07-15 10:58:46.706] [info] [prefix.cpp:278] DiskCacheManager root path: /home/mhr/ktransformers/kvcache/
Config.yaml
https://gist.github.com/moohr/ac9d74a1960fc5f9f15b58ae1b9ba6b5
Some things I have tried so far:
- #1430 #1413 mentioned problems with
/mnt/data/kvcbeing not present. I've changed the config.yaml to/home/mhr/ktransformers/kvcache/, as seen above. - Use ktransformers instead of balance_serve
- Compiling lots of times
- Fixed the missing
quant_configs.json
Thanks!
Environment
Running on Laptop Ubuntu24.04 WSL+Conda AMD Ryzen7 7745HX 64GB RAM RTX4070L 8GB VRAM Cuda 12.6 Commit: 1677e900923a4919b82b2a78603c6325e4b3e187 Should be more than enough disk space
Edit: typo
CUDA_VISIBLE_DEVICES=6 python ktransformers/server/main.py --model_path /home/models/Qwen3/Qwen3-30B-A3B/ --gguf_path /home/models/Qwen3/Qwen3-30B-A3B-Q4_K_M/ --architectures Qwen3MoeForCausalLM --cpu_infer 62 --port 10003 --backend_type balance_serve
可以试试不加额外参数
python ktransformers/server/main.py
--port 8080
--architectures Qwen3MoeForCausalLM
--model_name Qwen3-235B-A22B-Instruct-2507
--model_path "/mnt/shared/models/Qwen3-235B-A22B-Instruct-2507-GGUF"
--gguf_path "/mnt/shared/models/Qwen3-235B-A22B-Instruct-2507-GGUF/Q8_0"
--optimize_config_path ktransformers/optimize/optimize_rules/Qwen3Moe-serve.yaml
--cpu_infer 32
--temperature 0.7
--top_p 0.8
--top_k 20
--repetition_penalty 1.05
--max_new_tokens 60000
# --cache_lens 247680
--chunk_size 256
--max_batch_size 4
--backend_type balance_serve
注销了额外参数(--cache_lens)后,0.3.2成功运行Qwen3-235B-A22B-Instruct-2507-GGUF。
非要加参数的话,可以尝试 cache_lens = max_new_tokens * max_batch_size / 2,
而不是通常的 cache_lens = max_new_tokens * max_batch_size。