检查清单
- [x] 1. 我已经搜索过相关问题,但未能获得预期的帮助
- [ ] 2. 该问题在最新版本中尚未修复
- [ ] 3. 请注意,如果您提交的BUG相关 issue 缺少对应环境信息和最小可复现示例,我们将难以复现和定位问题,降低获得反馈的可能性
- [ ] 4. 如果您提出的不是bug而是问题,请在讨论区发起讨论 https://github.com/kvcache-ai/ktransformers/discussions。否则该 issue 将被关闭
- [ ] 5. 为方便社区交流,我将使用中文/英文或附上中文/英文翻译(如使用其他语言)。未附带翻译的非中文/英语内容可能会被关闭
问题描述
部署DeepSeek-V3.1-Q4_K_M, 使用optimize_rules DeepSeek-V3-Chat-multi-gpu-4.yaml 启用
=== MLP Experts Replacement ===
replace with marlin expert. Open and modify layer-num as needed.
Each layer of malin experts takes about 6GB of GPU memory.
!!!Do remember 'close' cuda graph if you are using marlin expert.!!!
!!!KExpertsTorch is untested, we don't have enough VRAM.!!!
GPU 0: layers 3–4
- match:
name: "^model\.layers\.([3-4])\.mlp\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts
kwargs:
generate_device: "cuda:0"
generate_op: "KExpertsMarlin"
recursive: False
启动报错
Injecting model.layers.3.mlp as ktransformers.operators.experts . KDeepseekV3MoE
Injecting model.layers.3.mlp.experts as ktransformers.operators.experts . KTransformersExperts
Traceback (most recent call last):
File "/opt/conda/bin/ktransformers", line 8, in
sys.exit(main())
^^^^^^
File "/opt/conda/lib/python3.11/site-packages/ktransformers/server/main.py", line 109, in main
create_interface(config=cfg, default_args=cfg)
File "/opt/conda/lib/python3.11/site-packages/ktransformers/server/utils/create_interface.py", line 30, in create_interface
GlobalInterface.interface = BackendInterface(default_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/ktransformers.py", line 63, in init
optimize_and_load_gguf(self.model, optimize_config_path, gguf_path, config)
File "/opt/conda/lib/python3.11/site-packages/ktransformers/optimize/optimize.py", line 128, in optimize_and_load_gguf
inject(module, optimize_config, model_config, weights_loader)
File "/opt/conda/lib/python3.11/site-packages/ktransformers/optimize/optimize.py", line 42, in inject
inject(child, child_optimization_dict, model_config, gguf_loader, child_prefix)
File "/opt/conda/lib/python3.11/site-packages/ktransformers/optimize/optimize.py", line 42, in inject
inject(child, child_optimization_dict, model_config, gguf_loader, child_prefix)
File "/opt/conda/lib/python3.11/site-packages/ktransformers/optimize/optimize.py", line 42, in inject
inject(child, child_optimization_dict, model_config, gguf_loader, child_prefix)
[Previous line repeated 1 more time]
File "/opt/conda/lib/python3.11/site-packages/ktransformers/optimize/optimize.py", line 33, in inject
inject_module=module_cls(key = inject_module_meta["key"], gguf_loader = gguf_loader, config = model_config, orig_module=child, **inject_module_meta["kwargs"])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/ktransformers/operators/experts.py", line 668, in init
self.generate_experts = EXPERTS_MAP[generate_op](key, gguf_loader, config, len(orig_module), device=generate_device, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/ktransformers/operators/experts.py", line 427, in init
self.up_projs = [KLinearMarlin(key+ "." + "ffn_up_exps", gguf_loader, config, device=device) for i in range(self.expert_num)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/ktransformers/operators/experts.py", line 427, in
self.up_projs = [KLinearMarlin(key+ "." + "ffn_up_exps", gguf_loader, config, device=device) for i in range(self.expert_num)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/ktransformers/operators/linear.py", line 597, in init
super().init(key, gguf_loader, config, orig_module, device, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/ktransformers/operators/linear.py", line 69, in init
shape = self.gguf_loader.tensor_info[key + ".weight"]["shape"]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
KeyError: 'model.layers.3.mlp.experts.ffn_up_exps.weight'
关闭如下注入规则
GPU 0: layers 3–4
- match:
name: "^model\.layers\.([3-4])\.mlp\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts
kwargs:
generate_device: "cuda:0"
generate_op: "KExpertsMarlin"
recursive: False
启动正常
复现步骤
ktransformers --model_path ./deepseek-v3-config/ --gguf_path ./deepseek-v3-gguf/ --optimize_config_path ./DeepSeek-V3-multigpu.yaml --port 10002 --log_level debug --no-use_cuda_graph
模型文件
DeepSeek-V3.1-Q4_K_M-00001-of-00009.gguf
DeepSeek-V3.1-Q4_K_M-00004-of-00009.gguf
DeepSeek-V3.1-Q4_K_M-00007-of-00009.gguf
DeepSeek-V3.1-Q4_K_M-00002-of-00009.gguf
DeepSeek-V3.1-Q4_K_M-00005-of-00009.gguf
DeepSeek-V3.1-Q4_K_M-00008-of-00009.gguf
DeepSeek-V3.1-Q4_K_M-00003-of-00009.gguf
DeepSeek-V3.1-Q4_K_M-00006-of-00009.gguf
DeepSeek-V3.1-Q4_K_M-00009-of-00009.gguf
环境信息
docker镜像:docker-hub.dahuatech.com/approachingai/ktransformers:v0.3.2-FANCY
DeepSeek-V3.1-Q4_K_M
显卡:NVIDIA L40
推理部分之前的版本的历史问题目前没有在维护了,推理部分的设计都集中在 sglang+kt-kernel 了,你可以参考 FAQ 查看使用: #1608
llamafile的qxk部分的权重支持目前还有点bug,得稍微等等
推理部分之前的版本的历史问题目前没有在维护了,推理部分的设计都集中在 sglang+kt-kernel 了,你可以参考 FAQ 查看使用: #1608 llamafile的qxk部分的权重支持目前还有点bug,得稍微等等
@KMSorSMS 感谢回复,那kt-kernel 推理后续还会支持注入规则的yaml配置吗,KLinearTorch、KLinearMarlin、KExpertsTorch、KExpertsMarlin这些算子还继续继承使用的吗, 还是说kt-kernel 后端只支持CPU推理,GPU推理都由sglang完成
那kt-kernel 推理后续还会支持注入规则的yaml配置吗,KLinearTorch、KLinearMarlin、KExpertsTorch、KExpertsMarlin这些算子还继续继承使用的吗, 还是说kt-kernel 后端只支持CPU推理,GPU推理都由sglang完成
应该后续都是变成 kt-kernel 后端只支持 cpu 推理,GPU 推理都由sglang 完成了,未来不再计划支持算子的注入了