Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
[x] 5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.

Describe the bug

Local chat fails and server does not open port.

Reproduction

Installation

I followed the one in the docs and also tried to compile my own version using USE_BALANCE_SERVE=1 KTRANSFORMERS_FORCE_BUILD=TRUE bash install.sh when the non-force-built version doesn't work.

Models

I've tried 3 models, none of which can start:

Deepseek-r1-0528-8b (qwen3, although I understand why it doesn't start, but still worth mentioning)
unsloth/Qwen3-30B-A3B-GGUF Q4_K_M (someone mentioned that unsloth's quants have issues with ktransformers, so I switched to another)
lmstudio-community/Qwen3-30B-A3B-GGUF Q4_K_M (which still doesn't work, same errors as the previous) Models metadata is in ./Qwen3-30B-A3B and the GGUF is in ./Qwen3-30B-A3B-GGUF/lmstudio-community/ (etc, I'm pretty sure it should be correct)

`pip list`

Please refer to https://gist.github.com/moohr/d8f0a595c3dcb77961c9c0d800abcb1d

Local-chat

TypeError: KQwen3MoeAttention.forward() got an unexpected keyword argument 'attention_mask'

Full log: https://gist.github.com/moohr/415f7daf67cabe9ed198efe1c64d0c1d

Server

Launch commands:

python ktransformers/server/main.py \
  --architectures Qwen3MoeForCausalLM \
  --model_path ./Qwen3-30B-A3B \
  --gguf_path ./Qwen3-30B-A3B-GGUF/lmstudio-community/ \
  --optimize_config_path ktransformers/optimize/optimize_rules/Qwen3Moe-serve.yaml \
  --backend_type balance_serve \
  --port 10002 \
  --cpu_infer 56 \
  --chunk_size 256 \
  --max_new_tokens 4096 \
  --max_batch_size 4 \
  --cache_lens 16384 \
  --web True

It seems to be stuck on sched_rpc. Using netstat, I can see there is a single open port, but not 10002. If I try to curl the port, it will just return 403.

loading model.layers.47.input_layernorm.weight to cuda
loading model.layers.47.post_attention_layernorm.weight to cuda
loading model.norm.weight to cuda
Getting inference context from sched_client.
sched_rpc started with PID: 2327
^CReceived signal 2, shutting down...
Cleaning up...
Process SpawnProcess-1:
Terminating subprocess 2150
/home/mhr/miniconda3/envs/ktransformers/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 13 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Full log: https://gist.github.com/moohr/b1a45a3901a5128859d9b81c2c1085f3

rpc.log

I see a lot of issues where the rpc.log is being asked for, so I might as well just include it here

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[2025-07-15 10:58:46.702] [info] [scheduler.cpp:31] Number of available GPUs: 1, want 1
[2025-07-15 10:58:46.702] [info] [scheduler.cpp:66] Each GPU Total: 1536MiB, Model Params: 0MiB, KVCache: 1536MiB, Left: 0MiB
[2025-07-15 10:58:46.702] [info] [scheduler.cpp:87] total_kvcache_pages is auto derived as 64
[2025-07-15 10:58:46.703] [info] [scheduler.cpp:933] Using Strategy FCFS
[2025-07-15 10:58:46.703] [info] [scheduler.cpp:459] 
Scheduler Settings:
  model_name: Qwen3-30B-A3B
  quant_type: BF16
    model_path: ./Qwen3-30B-A3B
    params_count: 0
    layer_count: 48
    num_k_heads: 4
    k_head_dim: 128
    bytes_per_params: 2
    bytes_per_kv_cache_element: 2
  page_size: 256
  gpu_device_id: 0
  gpu_memory_size: 1.61G
  memory_utilization_percentage: 1
  max_batch_size: 4
  recommended_chunk_prefill_token_count: 127
  sched_metrics_port: 52125
  kvc2_config_path: /home/mhr/.ktransformers/kvc2
  kvc2_root_path: /home/mhr/ktransformers/kvcache/
  memory_pool_size_GB: 64
  evict_count: 40
  kvc2_metrics_port: 54697
  load_from_disk: false
  save_to_disk: true
  strategy_name: FCFS
  gpu_device_count: 1

load_model_configs from "/home/mhr/.ktransformers/kvc2/model_configs.json"
Loaded Model Configs
 - Qwen3-30B-A3B
Load from "./Qwen3-30B-A3B/config.json"
[2025-07-15 10:58:46.704] [info] [prefix.cpp:1372] Creating KVC2 using these config
[2025-07-15 10:58:46.704] [info] [prefix.cpp:1373]     GPU Only: false
[2025-07-15 10:58:46.704] [info] [prefix.cpp:1374]     Load: false, Save: true
[2025-07-15 10:58:46.704] [info] [prefix.cpp:1375]     Path: /home/mhr/ktransformers/kvcache/
[2025-07-15 10:58:46.704] [info] [prefix.cpp:1376]     Config Path: /home/mhr/.ktransformers/kvc2
[2025-07-15 10:58:46.704] [info] [prefix.cpp:1377]     Num Token/Page: 256, Memory Pool Size: 64.00G
[2025-07-15 10:58:46.704] [info] [prefix.cpp:1379]     Evict Count: 40, Metrics Port: 54697
[2025-07-15 10:58:46.704] [info] [prefix.cpp:1380]     Recompute Ratio: 0.20
[2025-07-15 10:58:46.704] [info] [prefix.cpp:1384]     GPU Devices: 0
[2025-07-15 10:58:46.705] [info] [prefix.cpp:1385]     Layer Count: 48, Total KVCache Pages: 64
[2025-07-15 10:58:46.705] [info] [prefix.cpp:1387]     Num Token/Page: 256, Num K Heads: 4
[2025-07-15 10:58:46.705] [info] [prefix.cpp:1388]     K Head Dim: 128, Tensor Type: 15
[2025-07-15 10:58:46.705] [info] [prefix.cpp:1390]     MemcpyCudaStreams/Device: 4
load_model_configs from "/home/mhr/.ktransformers/kvc2/model_configs.json"
Loaded Model Configs
 - Qwen3-30B-A3B
load_quant_configs from "/home/mhr/.ktransformers/kvc2/quant_configs.json"
Loaded Quant Configs
 - BF16
 - FP16
 - FP32
 - Q4_0
 - Q8_0
[2025-07-15 10:58:46.706] [info] [prefix.cpp:1401] Creating kvc2 metrics exporter on 0.0.0.0:54697
[2025-07-15 10:58:46.706] [info] [prefix.cpp:278] DiskCacheManager root path: /home/mhr/ktransformers/kvcache/

Config.yaml

https://gist.github.com/moohr/ac9d74a1960fc5f9f15b58ae1b9ba6b5

Some things I have tried so far:

#1430 #1413 mentioned problems with /mnt/data/kvc being not present. I've changed the config.yaml to /home/mhr/ktransformers/kvcache/, as seen above.
Use ktransformers instead of balance_serve
Compiling lots of times
Fixed the missing quant_configs.json

Thanks!

Environment

Running on Laptop Ubuntu24.04 WSL+Conda AMD Ryzen7 7745HX 64GB RAM RTX4070L 8GB VRAM Cuda 12.6 Commit: 1677e900923a4919b82b2a78603c6325e4b3e187 Should be more than enough disk space

Edit: typo

Jul 15 '25 03:07 moohr

CUDA_VISIBLE_DEVICES=6 python ktransformers/server/main.py --model_path /home/models/Qwen3/Qwen3-30B-A3B/ --gguf_path /home/models/Qwen3/Qwen3-30B-A3B-Q4_K_M/ --architectures Qwen3MoeForCausalLM --cpu_infer 62 --port 10003 --backend_type balance_serve

可以试试不加额外参数

Jul 21 '25 09:07 zhz292

python ktransformers/server/main.py
--port 8080
--architectures Qwen3MoeForCausalLM
--model_name Qwen3-235B-A22B-Instruct-2507
--model_path "/mnt/shared/models/Qwen3-235B-A22B-Instruct-2507-GGUF"
--gguf_path "/mnt/shared/models/Qwen3-235B-A22B-Instruct-2507-GGUF/Q8_0"
--optimize_config_path ktransformers/optimize/optimize_rules/Qwen3Moe-serve.yaml
--cpu_infer 32
--temperature 0.7
--top_p 0.8
--top_k 20
--repetition_penalty 1.05
--max_new_tokens 60000
# --cache_lens 247680
--chunk_size 256
--max_batch_size 4
--backend_type balance_serve 注销了额外参数（--cache_lens）后，0.3.2成功运行Qwen3-235B-A22B-Instruct-2507-GGUF。非要加参数的话，可以尝试 cache_lens = max_new_tokens * max_batch_size / 2，而不是通常的 cache_lens = max_new_tokens * max_batch_size。

Jul 23 '25 06:07 igerry

[Bug] Cannot run Qwen3-30B-A3B at all

Checklist

Describe the bug

Reproduction

Installation

Models

pip list

Local-chat

Server

rpc.log

Config.yaml

Some things I have tried so far:

Environment

`pip list`