ktransformers icon indicating copy to clipboard operation
ktransformers copied to clipboard

[Bug] ktransformer0.24运行deepseek_q4_km爆内存

Open lxz-liu-666 opened this issue 3 months ago • 3 comments

Checklist

  • [ ] 1. I have searched related issues but cannot get the expected help.
  • [ ] 2. The bug has not been fixed in the latest version.
  • [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • [ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
  • [ ] 5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.

Describe the bug

电脑配置是4090*3,cpu是ddr5,512G,系统是ubuntun22.04, 已经成功安装了ktransformer0.24版本, 在运行前已设置了 export CUDA_VISIBLE_DEVICES=0,export USE_NUMA=1,export TORCH_CUDA_ARCH_LIST="8.9",export MAX_JOBS=70 然后运行ktransformer安装命令 python -m ktransformers.local_chat
--model_path /opt/DeepSeek-V3-0324
--gguf_path /opt/DeepSeek-V3-0324-GGUF/Q4_K_M
--optimize_config_path /opt/ktransformer/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-serve.yaml
--cpu_infer 70 当加载到40层左右时会爆内存,

Reproduction

export CUDA_VISIBLE_DEVICES=0,export USE_NUMA=1,export TORCH_CUDA_ARCH_LIST="8.9",export MAX_JOBS=70 python -m ktransformers.local_chat
--model_path /opt/DeepSeek-V3-0324
--gguf_path /opt/DeepSeek-V3-0324-GGUF/Q4_K_M
--optimize_config_path /opt/ktransformer/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-serve.yaml
--cpu_infer 70

Environment

电脑配置是4090*3,cpu是ddr5,512G,系统是ubuntun22.04,

lxz-liu-666 avatar Sep 18 '25 17:09 lxz-liu-666

用了NUMA了,运行时模型会占用x2的内存。看这个视频https://www.bilibili.com/video/BV1kV8AzKEjJ/ ,有人改了KT,可以做到开启NUMA只占1份内存。另外Q4可以被FastLLM支持,开启NUMA也只占用1份内存,如果不想死磕KT,建议用FastLLM。

wqshmzh avatar Sep 20 '25 12:09 wqshmzh

用了NUMA了,运行时模型会占用x2的内存。看这个视频https://www.bilibili.com/video/BV1kV8AzKEjJ/ ,有人改了KT,可以做到开启NUMA只占1份内存。另外Q4可以被FastLLM支持,开启NUMA也只占用1份内存,如果不想死磕KT,建议用FastLLM。

不对啊。我之前试过用win10系统的,当时是RTX3090*2,ddr4 512G 的,当时是安装的v0.23版本,开启后不用占2份内存,只是运行速度太慢了,2.5token/s,所以我才用了新配置,并且用了ubuntun系统。

lxz-liu-666 avatar Sep 29 '25 15:09 lxz-liu-666

用了NUMA了,运行时模型会占用x2的内存。看这个视频https://www.bilibili.com/video/BV1kV8AzKEjJ/ ,有人改了KT,可以做到开启NUMA只占1份内存。另外Q4可以被FastLLM支持,开启NUMA也只占用1份内存,如果不想死磕KT,建议用FastLLM。

我现在在尝试下载他的补丁,看能不能操作回来。

lxz-liu-666 avatar Sep 29 '25 16:09 lxz-liu-666