ktransformers [Bug] ktransformer0.24运行deepseek_q4

Checklist

[ ] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
[ ] 5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.

Describe the bug

电脑配置是4090*3，cpu是ddr5,512G,系统是ubuntun22.04, 已经成功安装了ktransformer0.24版本，在运行前已设置了 export CUDA_VISIBLE_DEVICES=0，export USE_NUMA=1，export TORCH_CUDA_ARCH_LIST="8.9"，export MAX_JOBS=70 然后运行ktransformer安装命令 python -m ktransformers.local_chat
--model_path /opt/DeepSeek-V3-0324
--gguf_path /opt/DeepSeek-V3-0324-GGUF/Q4_K_M
--optimize_config_path /opt/ktransformer/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-serve.yaml
--cpu_infer 70 当加载到40层左右时会爆内存，

Reproduction

export CUDA_VISIBLE_DEVICES=0，export USE_NUMA=1，export TORCH_CUDA_ARCH_LIST="8.9"，export MAX_JOBS=70 python -m ktransformers.local_chat
--model_path /opt/DeepSeek-V3-0324
--gguf_path /opt/DeepSeek-V3-0324-GGUF/Q4_K_M
--optimize_config_path /opt/ktransformer/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-serve.yaml
--cpu_infer 70

Environment

电脑配置是4090*3，cpu是ddr5,512G,系统是ubuntun22.04,

Sep 18 '25 17:09 lxz-liu-666

用了NUMA了，运行时模型会占用x2的内存。看这个视频https://www.bilibili.com/video/BV1kV8AzKEjJ/ ，有人改了KT，可以做到开启NUMA只占1份内存。另外Q4可以被FastLLM支持，开启NUMA也只占用1份内存，如果不想死磕KT，建议用FastLLM。

Sep 20 '25 12:09 wqshmzh

用了NUMA了，运行时模型会占用x2的内存。看这个视频https://www.bilibili.com/video/BV1kV8AzKEjJ/ ，有人改了KT，可以做到开启NUMA只占1份内存。另外Q4可以被FastLLM支持，开启NUMA也只占用1份内存，如果不想死磕KT，建议用FastLLM。

不对啊。我之前试过用win10系统的，当时是RTX3090*2，ddr4 512G 的，当时是安装的v0.23版本，开启后不用占2份内存，只是运行速度太慢了，2.5token/s，所以我才用了新配置，并且用了ubuntun系统。

Sep 29 '25 15:09 lxz-liu-666

用了NUMA了，运行时模型会占用x2的内存。看这个视频https://www.bilibili.com/video/BV1kV8AzKEjJ/ ，有人改了KT，可以做到开启NUMA只占1份内存。另外Q4可以被FastLLM支持，开启NUMA也只占用1份内存，如果不想死磕KT，建议用FastLLM。

我现在在尝试下载他的补丁，看能不能操作回来。

Sep 29 '25 16:09 lxz-liu-666

[Bug] ktransformer0.24运行deepseek_q4_km爆内存

Checklist

Describe the bug

Reproduction

Environment