ktransformers [Feature] EPYC性能优化

Checklist

[x] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
[x] 2. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-English/Chinese content without translation may be closed.

Motivation

经过测试，在GPU相近的情况下，双路xeon 6330 1TB RAM 单路threadripper 7975wx 512GB RAM 双路epyc 9654 1.5TB RAM 都跑出了相近的生成速度（10-12 token/s），然而在理论上，这三者不论是处理性能还是内存带宽都相去甚远，尤其是EPYC 9654 x2性能应该远好于上述指标。

潜在的可能是，EPYC的chiplet设计使其在单个CPU内部实际上也分成了4个访存象限，跨象限访存实际上有些像是numa方式，即使总带宽很高（双路共计24通道DDR 5200），实际上瓶颈出现在跨象限和跨CPU访问内存上？

如果按照这个思路来切分，那么双路系统要算成8个NUMA节点，这样会导致每个NUMA节点内存不足以存放完整的权重，不知道有什么解决方案？

我能想到的一个是用Q1S量化模型，以确保每个NUMA都能放下一份，这个显然不是合理的解决方案。另一个则是已经提了很久的“类张量并行（CPU TP）“，也就是每个NUMA只处理一部分权重和计算，就像GPU TP一样，这方面有什么解决思路吗？

实际上，intel处理器也正在转向这一路线，6980p就是典型的3个tile，每个tile提供4通道本地内存控制器，其他tile访问其内存也有性能损失。

Related resources

vLLM CPU backend

May 12 '25 14:05 yeungtuzi

https://docs.vllm.ai/en/v0.8.3/getting_started/installation/cpu.html

Supported features vLLM CPU backend supports the following vLLM features:

Tensor Parallel

Model Quantization (INT8 W8A8, AWQ, GPTQ)

Chunked-prefill

Prefix-caching

FP8-E5M2 KV cache

On CPU based setup with NUMA enabled, the memory access performance may be largely impacted by the topology. For NUMA architecture, Tensor Parallel is a option for better performance.

Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:

VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp

For each thread id list in VLLM_CPU_OMP_THREADS_BIND, users should guarantee threads in the list belong to a same NUMA node.

Meanwhile, users should also take care of memory capacity of each NUMA node. The memory usage of each TP rank is the sum of weight shard size and VLLM_CPU_KVCACHE_SPACE, if it exceeds the capacity of a single NUMA node, TP worker will be killed due to out-of-memory.

May 12 '25 14:05 yeungtuzi

https://docs.vllm.ai/en/v0.8.3/getting_started/installation/cpu.html

Supported features vLLM CPU backend supports the following vLLM features: Tensor Parallel Model Quantization (INT8 W8A8, AWQ, GPTQ) Chunked-prefill Prefix-caching FP8-E5M2 KV cache

On CPU based setup with NUMA enabled, the memory access performance may be largely impacted by the topology. For NUMA architecture, Tensor Parallel is a option for better performance.

Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:

VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp

For each thread id list in VLLM_CPU_OMP_THREADS_BIND, users should guarantee threads in the list belong to a same NUMA node.

Meanwhile, users should also take care of memory capacity of each NUMA node. The memory usage of each TP rank is the sum of weight shard size and VLLM_CPU_KVCACHE_SPACE, if it exceeds the capacity of a single NUMA node, TP worker will be killed due to out-of-memory.

Yes, you’re absolutely right — we do prefer the TP (Tensor Parallel) method. As modern CPUs and accelerators increasingly feature multiple NUMA nodes (or similar architectures), we’re actively exploring how to integrate Tensor Parallelism in our MoE (Mixture of Experts) implementations and even within MLA (For future pure CPU version).

If you’re interested in contributing, a great place to start is the MoE component. We’d greatly welcome your involvement—your expertise could make a valuable impact on our community.

May 13 '25 03:05 KMSorSMS

非常荣幸得到热情回复，我已经很久不编程序了，所以估计直接参与代码工作有点儿难，我看我能发挥什么作用吧，比如找机会拉一些开发人员进来，提供开发测试环境和测试之类的工作倒是可以。

May 14 '25 05:05 yeungtuzi

非常荣幸得到热情回复，我已经很久不编程序了，所以估计直接参与代码工作有点儿难，我看我能发挥什么作用吧，比如找机会拉一些开发人员进来，提供开发测试环境和测试之类的工作倒是可以。

没事的，我们目前也有在做相应的开发工作，后续有机会有更多开发测试环境的话当然是太好啦

May 14 '25 06:05 KMSorSMS

好的，我手里现在就有9654和9565的环境，需要的话随时打招呼

May 14 '25 07:05 yeungtuzi

好的，我手里现在就有9654和9565的环境，需要的话随时打招呼

好的好的，非常感谢🙏

May 14 '25 07:05 KMSorSMS

Hi guys, has there been any progress towards multi-NUMA tensor parallel? I see sglang recently implemented this feature ( https://lmsys.org/blog/2025-07-14-intel-xeon-optimization/ ) but they do not have the partial offload solution that makes ktransformers so attractive.

Jul 29 '25 19:07 aikitoria

Hi guys, has there been any progress towards multi-NUMA tensor parallel? I see sglang recently implemented this feature ( https://lmsys.org/blog/2025-07-14-intel-xeon-optimization/ ) but they do not have the partial offload solution that makes ktransformers so attractive.

Actually, we have done the TP part on multi-numa, but we decided to release it with some other features in the future. And thanks for your interest and information. :relaxed:

Jul 30 '25 12:07 KMSorSMS

Awesome, looking forward to it being released!

Jul 30 '25 12:07 aikitoria

When you say in future, do you already have an estimate when this update will come?

Jul 30 '25 13:07 aikitoria

When you say in future, do you already have an estimate when this update will come?

Well, not yet. Since the other staff's work will determine this, I would say it will come within the next 2 months, I guess.

Jul 31 '25 02:07 KMSorSMS