[Bug] Why DUAL-CPU has no speed up?
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
- [x] 5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.
Describe the bug
DUAL-CPU's decode token speed is equal to Single-CPU.
Reproduction
When I bought a new dual-CPU server with AMX support, and install with libnuma-dev and USE_NUMA=1, but the speed is not up. always ≈ 13 token/s.
When it working, I monitor the RAM from 1-CPU's 400G up to 2-CPU's 800G, and both CPU cores working. But the decod speed is equal than 1-CPU with 512G RAM.
Environment
CPU: 6416H * 2 RAM: 64G DDR5 * 16 = 1T GPU: RTX 3090
KTransformers tag: v0.3.2
Is it possible that the bottleneck is the GPU now? 3090 is quite close to V100 in term of Flops. Just a guess.
Is it possible that the bottleneck is the GPU now? 3090 is quite close to V100 in term of Flops. Just a guess.
However, aren't ktransformers entirely computed by the CPU? The GPU only uses video memory. Because when I was executing decode, the CPU was at 100%, while the GPU was only at 30%, and the fan wasn't even spinning.