[Bug] 0.3.2版本卡在Getting inference context from sched_client. sched_rpc started with PID: xxx
检查清单
- [x] 1. 我已经搜索过相关问题,但未能获得预期的帮助
- [x] 2. 该问题在最新版本中尚未修复
- [x] 3. 请注意,如果您提交的BUG相关 issue 缺少对应环境信息和最小可复现示例,我们将难以复现和定位问题,降低获得反馈的可能性
- [x] 4. 如果您提出的不是bug而是问题,请在讨论区发起讨论 https://github.com/kvcache-ai/ktransformers/discussions。否则该 issue 将被关闭
- [x] 5. 为方便社区交流,我将使用中文/英文或附上中文/英文翻译(如使用其他语言)。未附带翻译的非中文/英语内容可能会被关闭
问题描述
按照官方操作編譯並安裝成功,版本是0.3.2+cu128torch27fancy 有看到一個issue跟我有類似的問題 #1413 ,底下回覆是把DiskCacheManager root path開權限就可以 但我已經把路徑全開777權限了還是卡在一樣的地方,另外模型都是HF下載的 GGUF : https://huggingface.co/Qwen/Qwen3-30B-A3B-GGUF/blob/main/Qwen3-30B-A3B-Q4_K_M.gguf 原始模型: https://huggingface.co/Qwen/Qwen3-30B-A3B
嘗試過以下組合,都是一樣的問題: 顯卡: 目前測試了A6000跟5070這兩張卡 OS: 目前測試了Ubuntu 22.04跟24.04
目前嘗試定位卡住的地方,發現是在 SchedulerClient的send_request沒有response def send_request(self, method, params=None): if params is None: params = {} request = { 'method': method, 'params': params } print(f'send request {request}') self.socket.send(pickle.dumps(request)) print(f"send done") response = self.socket.recv() <---------------------卡在這一行 print(f"{response=}") response = pickle.loads(response) if response.get('status') == 'ok': return response else: raise Exception(f"Error from server: {response.get('message')}")
复现步骤
- Command:
python3 ktransformers/server/main.py --architectures Qwen3MoeForCausalLM --model_path /mnt/model/Qwen3-30B-A3B/ --gguf_path /mnt/model/Qwen3-30B-A3B-GGUF/ --optimize_config_path /home/user/vincent/ktransformers/ktransformers/optimize/optimize_rules/Qwen3Moe-serve.yaml --backend_type balance_serve
- Output log:
found flashinfer flash_attn not found, flashinfer unit test needed it. If you are using balance serve, ignore this. set start method Connected to server at tcp://localhost:50907 found flashinfer flash_attn not found, flashinfer unit test needed it. If you are using balance serve, ignore this. start method already set to spawn Connected to server at tcp://localhost:50907 falsh attn not found Injecting model as ktransformers.operators.models . KQwen2MoeModel Injecting model.embed_tokens as default Injecting model.layers as default Injecting model.layers.0 as default Injecting model.layers.0.self_attn as ktransformers.operators.balance_serve_attention . KQwen3MoeAttention Injecting model.layers.0.self_attn.q_proj as ktransformers.operators.linear . KTransformersLinear Injecting model.layers.0.self_attn.k_proj as ktransformers.operators.linear . KTransformersLinear Injecting model.layers.0.self_attn.v_proj as ktransformers.operators.linear . KTransformersLinear Injecting model.layers.0.self_attn.o_proj as ktransformers.operators.linear . KTransformersLinear Injecting model.layers.0.self_attn.q_norm as ktransformers.operators.layernorm . KQwen3MoeRMSNorm 中間省略 loading model.layers.43.self_attn.k_norm.weight to cuda loading model.layers.43.input_layernorm.weight to cuda loading model.layers.43.post_attention_layernorm.weight to cuda loading model.layers.44.self_attn.q_norm.weight to cuda loading model.layers.44.self_attn.k_norm.weight to cuda loading model.layers.44.input_layernorm.weight to cuda loading model.layers.44.post_attention_layernorm.weight to cuda loading model.layers.45.self_attn.q_norm.weight to cuda loading model.layers.45.self_attn.k_norm.weight to cuda loading model.layers.45.input_layernorm.weight to cuda loading model.layers.45.post_attention_layernorm.weight to cuda loading model.layers.46.self_attn.q_norm.weight to cuda loading model.layers.46.self_attn.k_norm.weight to cuda loading model.layers.46.input_layernorm.weight to cuda loading model.layers.46.post_attention_layernorm.weight to cuda loading model.layers.47.self_attn.q_norm.weight to cuda loading model.layers.47.self_attn.k_norm.weight to cuda loading model.layers.47.input_layernorm.weight to cuda loading model.layers.47.post_attention_layernorm.weight to cuda loading model.norm.weight to cuda Getting inference context from sched_client. sched_rpc started with PID: 22599
- rpc.log:
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
[2025-07-10 10:20:15.078] [info] [scheduler.cpp:31] Number of available GPUs: 1, want 1
[2025-07-10 10:20:15.078] [info] [scheduler.cpp:66] Each GPU Total: 1536MiB, Model Params: 0MiB, KVCache: 1536MiB, Left: 0MiB
[2025-07-10 10:20:15.078] [info] [scheduler.cpp:87] total_kvcache_pages is auto derived as 64
[2025-07-10 10:20:15.078] [info] [scheduler.cpp:933] Using Strategy FCFS
[2025-07-10 10:20:15.078] [info] [scheduler.cpp:459]
Scheduler Settings:
model_name: Qwen3-30B-A3B
quant_type: BF16
model_path: /mnt/model/Qwen3-30B-A3B/
params_count: 0
layer_count: 48
num_k_heads: 4
k_head_dim: 128
bytes_per_params: 2
bytes_per_kv_cache_element: 2
page_size: 256
gpu_device_id: 0
gpu_memory_size: 1.61G
memory_utilization_percentage: 1
max_batch_size: 4
recommended_chunk_prefill_token_count: 127
sched_metrics_port: 54855
kvc2_config_path: /home/user/.ktransformers/kvc2
kvc2_root_path: /mnt/data/kvc
memory_pool_size_GB: 500
evict_count: 40
kvc2_metrics_port: 44815
load_from_disk: false
save_to_disk: true
strategy_name: FCFS
gpu_device_count: 1
load_model_configs from "/home/user/.ktransformers/kvc2/model_configs.json" Loaded Model Configs
- Qwen3-30B-A3B Load from "/mnt/model/Qwen3-30B-A3B/config.json" [2025-07-10 10:20:15.078] [info] [prefix.cpp:1372] Creating KVC2 using these config [2025-07-10 10:20:15.078] [info] [prefix.cpp:1373] GPU Only: false [2025-07-10 10:20:15.078] [info] [prefix.cpp:1374] Load: false, Save: true [2025-07-10 10:20:15.078] [info] [prefix.cpp:1375] Path: /mnt/data/kvc [2025-07-10 10:20:15.078] [info] [prefix.cpp:1376] Config Path: /home/user/.ktransformers/kvc2 [2025-07-10 10:20:15.078] [info] [prefix.cpp:1377] Num Token/Page: 256, Memory Pool Size: 500.00G [2025-07-10 10:20:15.078] [info] [prefix.cpp:1379] Evict Count: 40, Metrics Port: 44815 [2025-07-10 10:20:15.078] [info] [prefix.cpp:1380] Recompute Ratio: 0.20 [2025-07-10 10:20:15.078] [info] [prefix.cpp:1384] GPU Devices: 0 [2025-07-10 10:20:15.078] [info] [prefix.cpp:1385] Layer Count: 48, Total KVCache Pages: 64 [2025-07-10 10:20:15.078] [info] [prefix.cpp:1387] Num Token/Page: 256, Num K Heads: 4 [2025-07-10 10:20:15.078] [info] [prefix.cpp:1388] K Head Dim: 128, Tensor Type: 15 [2025-07-10 10:20:15.078] [info] [prefix.cpp:1390] MemcpyCudaStreams/Device: 4 load_model_configs from "/home/user/.ktransformers/kvc2/model_configs.json" Loaded Model Configs
- Qwen3-30B-A3B load_quant_configs no file at "/home/user/.ktransformers/kvc2/quant_configs.json" [2025-07-10 10:20:15.078] [info] [prefix.cpp:1401] Creating kvc2 metrics exporter on 0.0.0.0:44815 [2025-07-10 10:20:15.078] [info] [prefix.cpp:278] DiskCacheManager root path: /mnt/data/kvc
环境信息
OS: Ubuntu 22.04.4 LTS+Python 3.11 DRAM: DDR4 32GB*4 GPU: NV RTX A6000 CPU: 11th Gen Intel(R) Core(TM) i5-11400 DRIVER: NV 570 CUDA: 12.8
- Python package:
Package Version
accelerate 1.8.1 annotated-types 0.7.0 anyio 4.9.0 blessed 1.21.0 blobfile 3.0.0 build 1.2.2.post1 certifi 2025.7.9 charset-normalizer 3.4.2 click 8.2.1 colorlog 6.9.0 cpufeature 0.2.1 distro 1.9.0 fastapi 0.116.0 filelock 3.13.1 fire 0.7.0 flashinfer-python 0.2.3 fsspec 2024.6.1 greenlet 3.2.3 h11 0.16.0 hf-xet 1.1.5 httpcore 1.0.9 httpx 0.28.1 huggingface-hub 0.33.2 idna 3.10 Jinja2 3.1.4 jiter 0.10.0 jsonpatch 1.33 jsonpointer 3.0.0 ktransformers 0.3.2+cu128torch27fancy langchain 0.3.26 langchain-core 0.3.68 langchain-text-splitters 0.3.8 langsmith 0.4.4 lxml 6.0.0 MarkupSafe 2.1.5 mpmath 1.3.0 networkx 3.3 ninja 1.11.1.4 numpy 2.1.2 nvidia-cublas-cu12 12.8.3.14 nvidia-cuda-cupti-cu12 12.8.57 nvidia-cuda-nvrtc-cu12 12.8.61 nvidia-cuda-runtime-cu12 12.8.57 nvidia-cudnn-cu12 9.7.1.26 nvidia-cufft-cu12 11.3.3.41 nvidia-cufile-cu12 1.13.0.11 nvidia-curand-cu12 10.3.9.55 nvidia-cusolver-cu12 11.7.2.55 nvidia-cusparse-cu12 12.5.7.53 nvidia-cusparselt-cu12 0.6.3 nvidia-nccl-cu12 2.26.2 nvidia-nvjitlink-cu12 12.8.61 nvidia-nvtx-cu12 12.8.55 openai 1.93.2 orjson 3.10.18 packaging 24.2 pillow 11.0.0 pip 25.1 protobuf 6.31.1 psutil 7.0.0 pycryptodomex 3.23.0 pydantic 2.11.7 pydantic_core 2.33.2 pyproject_hooks 1.2.0 PyYAML 6.0.2 pyzmq 27.0.0 regex 2024.11.6 requests 2.32.4 requests-toolbelt 1.0.0 safetensors 0.5.3 sentencepiece 0.2.0 setuptools 78.1.1 sniffio 1.3.1 SQLAlchemy 2.0.41 starlette 0.46.2 sympy 1.13.3 tenacity 9.1.2 termcolor 3.1.0 tiktoken 0.9.0 tokenizers 0.21.2 torch 2.7.1+cu128 torchaudio 2.7.1+cu128 torchvision 0.22.1+cu128 tqdm 4.67.1 transformers 4.51.3 triton 3.3.1 typing_extensions 4.12.2 typing-inspection 0.4.1 urllib3 2.5.0 uvicorn 0.35.0 wcwidth 0.2.13 wheel 0.45.1 zmq 0.0.0 zstandard 0.23.0
補充,目前降版到v0.3.1是正常的
vim ~/.ktransformers/config.yaml 检查kvc2对应配置中,
- disk_path参数对应路径,执行kt的用户是否有权限读写,
- cpu_memory_size_GB参数对应磁盘空间大小是否小于disk_path指向路径的剩余可用空间大小。(例如,剩余磁盘300G,则此参数应当设置为小于300)
楼上正解,我再补充一下:/mnt/data/kvc这个路径(这玩意0.3.2版本引入的,为了prefill cache),不一定所有服务器都有,没有的就会卡住
首先檢查權限,確定kvc2資料夾全開 user@pc1:/mnt/nvme0$ ls -al total 28 drwxrwxrwx 4 root root 4096 七 9 14:10 . drwxrwxrwx 4 root root 4096 七 10 17:43 .. drwxrwxrwx 2 user user 4096 七 9 14:10 kvc2 drwxrwxrwx 2 root root 16384 七 8 19:56 lost+found
接著檢查容量,還有大概1.8TB可以使用 user@pc1:/mnt/nvme0$ df -h Filesystem Size Used Avail Use% Mounted on /dev/nvme0n1 1.9T 32K 1.8T 1% /mnt/nvme0
檢查/home/user/.ktransformers/config.yaml kvc2: gpu_only: false utilization_percentage: 1.0 cpu_memory_size_GB: 500 disk_path: /mnt/nvme0/kvc2
一樣卡在sched_rpc started with PID: xxxxx
/mnt路径可能特殊,你可以创建一个/home/user/kvc目录再试
遇到同样的问题,0.3.2的single api和local chat都可以正常跑起来,backend balance就跑不起来,也是卡在Getting inference context from sched_client. sched_rpc started with PID: xxx 确认启动命令是没有问题的,文件夹用的是/home/user/data/kvc,权限用chown和chmod都修改过了,optimize rules用的也是对的,尝试了自己修改Dockerfile,docker里也是卡在这边。同时观察到一个现象,删掉~/.ktransformers文件夹后重新编译ktransformers并启动,.ktransformers文件夹并不会再次被自动创建。
把rpc.log丢出来吧,~/.ktransformers/里有个logs目录
把rpc.log丢出来吧,~/.ktransformers/里有个logs目录
我这边重装系统后没有用conda部署环境,换成编译dockerfile后,v0.3.2的问题通过进入容器修改成~/.ktransformers/config.yaml里的disk_path: /models/kvc然后commit成功解决问题了,想问下有没有办法这个disk_path的设置是在源码里的哪个地方修改,我去更新下Dockerfile贴出来
一样卡在sched_rpc started with PID这里
[2025-07-18 12:23:36.585] [info] [scheduler.cpp:31] Number of available GPUs: 1, want 1 [2025-07-18 12:23:36.586] [info] [scheduler.cpp:66] Each GPU Total: 4392MiB, Model Params: 0MiB, KVCache: 4392MiB, Left: 0MiB [2025-07-18 12:23:36.586] [info] [scheduler.cpp:87] total_kvcache_pages is auto derived as 256 [2025-07-18 12:23:36.586] [info] [scheduler.cpp:933] Using Strategy FCFS [2025-07-18 12:23:36.586] [info] [scheduler.cpp:459] Scheduler Settings: model_name: DeepSeek-TNG-R1T2-Chimera quant_type: BF16 model_path: tngtech/DeepSeek-TNG-R1T2-Chimera params_count: 0 layer_count: 61 num_k_heads: 1 k_head_dim: 576 bytes_per_params: 2 bytes_per_kv_cache_element: 2 page_size: 256 gpu_device_id: 0 gpu_memory_size: 4.61G memory_utilization_percentage: 1 max_batch_size: 4 recommended_chunk_prefill_token_count: 127 sched_metrics_port: 44715 kvc2_config_path: /root/.ktransformers/kvc2 kvc2_root_path: /var/cache/kvc memory_pool_size_GB: 500 evict_count: 40 kvc2_metrics_port: 36237 load_from_disk: false save_to_disk: true strategy_name: FCFS gpu_device_count: 1
load_model_configs no file at "/root/.ktransformers/kvc2/model_configs.json" Load from "tngtech/DeepSeek-TNG-R1T2-Chimera/config.json"
已经创建了对应/var/cache/kvc目录。
@xufengnian 我修改了文件路径, 并且确认容量足够且有权限, 但是还是卡在这里, rpc.log 如下, 接下来要如何处理或者排除问题呢?
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[2025-07-26 23:59:08.040] [info] [scheduler.cpp:31] Number of available GPUs: 1, want 1
[2025-07-26 23:59:08.040] [info] [scheduler.cpp:66] Each GPU Total: 3072MiB, Model Params: 0MiB, KVCache: 3072MiB, Left: 0MiB
[2025-07-26 23:59:08.040] [info] [scheduler.cpp:87] total_kvcache_pages is auto derived as 128
[2025-07-26 23:59:08.040] [info] [scheduler.cpp:933] Using Strategy FCFS
[2025-07-26 23:59:08.040] [info] [scheduler.cpp:459]
Scheduler Settings:
model_name: Qwen3-30B-A3B
quant_type: BF16
model_path: /home/mjrt/ai-models/Qwen3-30B-A3B
params_count: 0
layer_count: 48
num_k_heads: 4
k_head_dim: 128
bytes_per_params: 2
bytes_per_kv_cache_element: 2
page_size: 256
gpu_device_id: 0
gpu_memory_size: 3.22G
memory_utilization_percentage: 1
max_batch_size: 2
recommended_chunk_prefill_token_count: 128
sched_metrics_port: 45881
kvc2_config_path: /home/mjrt/.ktransformers/kvc2
kvc2_root_path: /home/mjrt/ktransformer-data
memory_pool_size_GB: 500
evict_count: 40
kvc2_metrics_port: 46137
load_from_disk: false
save_to_disk: true
strategy_name: FCFS
gpu_device_count: 1
load_model_configs from "/home/mjrt/.ktransformers/kvc2/model_configs.json"
Loaded Model Configs
- Qwen3-30B-A3B
Load from "/home/mjrt/ai-models/Qwen3-30B-A3B/config.json"
[2025-07-26 23:59:08.045] [info] [prefix.cpp:1372] Creating KVC2 using these config
[2025-07-26 23:59:08.045] [info] [prefix.cpp:1373] GPU Only: false
[2025-07-26 23:59:08.045] [info] [prefix.cpp:1374] Load: false, Save: true
[2025-07-26 23:59:08.045] [info] [prefix.cpp:1375] Path: /home/mjrt/ktransformer-data
[2025-07-26 23:59:08.045] [info] [prefix.cpp:1376] Config Path: /home/mjrt/.ktransformers/kvc2
[2025-07-26 23:59:08.045] [info] [prefix.cpp:1377] Num Token/Page: 256, Memory Pool Size: 500.00G
[2025-07-26 23:59:08.045] [info] [prefix.cpp:1379] Evict Count: 40, Metrics Port: 46137
[2025-07-26 23:59:08.045] [info] [prefix.cpp:1380] Recompute Ratio: 0.20
[2025-07-26 23:59:08.045] [info] [prefix.cpp:1384] GPU Devices: 0
[2025-07-26 23:59:08.045] [info] [prefix.cpp:1385] Layer Count: 48, Total KVCache Pages: 128
[2025-07-26 23:59:08.045] [info] [prefix.cpp:1387] Num Token/Page: 256, Num K Heads: 4
[2025-07-26 23:59:08.045] [info] [prefix.cpp:1388] K Head Dim: 128, Tensor Type: 15
[2025-07-26 23:59:08.045] [info] [prefix.cpp:1390] MemcpyCudaStreams/Device: 4
load_model_configs from "/home/mjrt/.ktransformers/kvc2/model_configs.json"
Loaded Model Configs
- Qwen3-30B-A3B
load_quant_configs no file at "/home/mjrt/.ktransformers/kvc2/quant_configs.json"
[2025-07-26 23:59:08.046] [info] [prefix.cpp:1401] Creating kvc2 metrics exporter on 0.0.0.0:46137
[2025-07-26 23:59:08.053] [info] [prefix.cpp:278] DiskCacheManager root path: /home/mjrt/ktransformer-data
你好,我也出现了卡在pid这步,尝试了修改kvc目录等。以下是rpc日志:
/opt/miniconda3/envs/kt/lib/python3.11/site-packages/transformers/utils/hub.py:105: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
warnings.warn(
set start method
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
[2025-07-27 17:54:06.506] [info] [scheduler.cpp:31] Number of available GPUs: 1, want 1
[2025-07-27 17:54:06.507] [info] [scheduler.cpp:66] Each GPU Total: 96MiB, Model Params: 0MiB, KVCache: 96MiB, Left: 0MiB
[2025-07-27 17:54:06.507] [info] [scheduler.cpp:87] total_kvcache_pages is auto derived as 4
[2025-07-27 17:54:06.507] [info] [scheduler.cpp:933] Using Strategy FCFS
[2025-07-27 17:54:06.507] [info] [scheduler.cpp:459]
Scheduler Settings:
model_name: original_config
quant_type: BF16
model_path: /root/ktransformers_models/qwen3-30b-a3b/original_config
params_count: 0
layer_count: 48
num_k_heads: 4
k_head_dim: 128
bytes_per_params: 2
bytes_per_kv_cache_element: 2
page_size: 256
gpu_device_id: 0
gpu_memory_size: 100.66M
memory_utilization_percentage: 1
max_batch_size: 1
recommended_chunk_prefill_token_count: 8
sched_metrics_port: 42186
kvc2_config_path: /root/.ktransformers/kvc2
kvc2_root_path: /tmp/kvc_cache
memory_pool_size_GB: 500
evict_count: 40
kvc2_metrics_port: 43898
load_from_disk: false
save_to_disk: true
strategy_name: FCFS
gpu_device_count: 1
load_model_configs from "/root/.ktransformers/kvc2/model_configs.json" Loaded Model Configs
- original_config Load from "/root/ktransformers_models/qwen3-30b-a3b/original_config/config.json" [2025-07-27 17:54:06.509] [info] [prefix.cpp:1372] Creating KVC2 using these config [2025-07-27 17:54:06.509] [info] [prefix.cpp:1373] GPU Only: false [2025-07-27 17:54:06.509] [info] [prefix.cpp:1374] Load: false, Save: true [2025-07-27 17:54:06.509] [info] [prefix.cpp:1375] Path: /tmp/kvc_cache [2025-07-27 17:54:06.509] [info] [prefix.cpp:1376] Config Path: /root/.ktransformers/kvc2 [2025-07-27 17:54:06.509] [info] [prefix.cpp:1377] Num Token/Page: 256, Memory Pool Size: 500.00G [2025-07-27 17:54:06.509] [info] [prefix.cpp:1379] Evict Count: 40, Metrics Port: 43898 [2025-07-27 17:54:06.509] [info] [prefix.cpp:1380] Recompute Ratio: 0.20 [2025-07-27 17:54:06.509] [info] [prefix.cpp:1384] GPU Devices: 0 [2025-07-27 17:54:06.509] [info] [prefix.cpp:1385] Layer Count: 48, Total KVCache Pages: 4 [2025-07-27 17:54:06.509] [info] [prefix.cpp:1387] Num Token/Page: 256, Num K Heads: 4 [2025-07-27 17:54:06.509] [info] [prefix.cpp:1388] K Head Dim: 128, Tensor Type: 15 [2025-07-27 17:54:06.509] [info] [prefix.cpp:1390] MemcpyCudaStreams/Device: 4 load_model_configs from "/root/.ktransformers/kvc2/model_configs.json" Loaded Model Configs
- original_config load_quant_configs no file at "/root/.ktransformers/kvc2/quant_configs.json" [2025-07-27 17:54:06.510] [info] [prefix.cpp:1401] Creating kvc2 metrics exporter on 0.0.0.0:43898 [2025-07-27 17:54:06.511] [info] [prefix.cpp:278] DiskCacheManager root path: /tmp/kvc_cache Fatal Python error: Segmentation fault
Current thread 0x0000761c45adb600 (most recent call first):
File "/opt/kt/ktransformers/server/balance_serve/sched_rpc.py", line 27 in init
File "/opt/kt/ktransformers/server/balance_serve/sched_rpc.py", line 145 in start_server
File "/opt/kt/ktransformers/server/balance_serve/sched_rpc.py", line 218 in
Extension modules: zmq.backend.cython._zmq, numpy._core._multiarray_umath, numpy.linalg._umath_linalg, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, yaml._yaml, zstandard.backend_c, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, markupsafe._speedups, PIL._imaging, psutil._psutil_linux, psutil._psutil_posix (total: 23)
補充,目前降版到v0.3.1是正常的
不要试图编译balance_serv的Release版本: https://github.com/kvcache-ai/ktransformers/issues/1464
Supplement, it is normal to downgrade to v0.3.1 at present
Don't try to compile the balance_serv's release version: #1464
0.3.2 worked for me