[Bug] ktransformer0.24运行deepseek_q4_km爆内存
Checklist
- [ ] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
- [ ] 5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.
Describe the bug
电脑配置是4090*3,cpu是ddr5,512G,系统是ubuntun22.04,
已经成功安装了ktransformer0.24版本,
在运行前已设置了
export CUDA_VISIBLE_DEVICES=0,export USE_NUMA=1,export TORCH_CUDA_ARCH_LIST="8.9",export MAX_JOBS=70
然后运行ktransformer安装命令
python -m ktransformers.local_chat
--model_path /opt/DeepSeek-V3-0324
--gguf_path /opt/DeepSeek-V3-0324-GGUF/Q4_K_M
--optimize_config_path /opt/ktransformer/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-serve.yaml
--cpu_infer 70
当加载到40层左右时会爆内存,
Reproduction
export CUDA_VISIBLE_DEVICES=0,export USE_NUMA=1,export TORCH_CUDA_ARCH_LIST="8.9",export MAX_JOBS=70
python -m ktransformers.local_chat
--model_path /opt/DeepSeek-V3-0324
--gguf_path /opt/DeepSeek-V3-0324-GGUF/Q4_K_M
--optimize_config_path /opt/ktransformer/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-serve.yaml
--cpu_infer 70
Environment
电脑配置是4090*3,cpu是ddr5,512G,系统是ubuntun22.04,
用了NUMA了,运行时模型会占用x2的内存。看这个视频https://www.bilibili.com/video/BV1kV8AzKEjJ/ ,有人改了KT,可以做到开启NUMA只占1份内存。另外Q4可以被FastLLM支持,开启NUMA也只占用1份内存,如果不想死磕KT,建议用FastLLM。
用了NUMA了,运行时模型会占用x2的内存。看这个视频https://www.bilibili.com/video/BV1kV8AzKEjJ/ ,有人改了KT,可以做到开启NUMA只占1份内存。另外Q4可以被FastLLM支持,开启NUMA也只占用1份内存,如果不想死磕KT,建议用FastLLM。
不对啊。我之前试过用win10系统的,当时是RTX3090*2,ddr4 512G 的,当时是安装的v0.23版本,开启后不用占2份内存,只是运行速度太慢了,2.5token/s,所以我才用了新配置,并且用了ubuntun系统。
用了NUMA了,运行时模型会占用x2的内存。看这个视频https://www.bilibili.com/video/BV1kV8AzKEjJ/ ,有人改了KT,可以做到开启NUMA只占1份内存。另外Q4可以被FastLLM支持,开启NUMA也只占用1份内存,如果不想死磕KT,建议用FastLLM。
我现在在尝试下载他的补丁,看能不能操作回来。