llama.cpp AMD EPYC 9654 is not optimized for max speed

I have AMD EPYC 9654 and it has 96 cores 192 threads. When running llama.cpp /main with Yi-34b-chat Q4, the peek inferencing speed tops at around 60 threads. Setting more threads in the command will start slowing down the speed.

It looks like CPU with cores more than 64 are not optimized, there could be more speed to be released.

Warm Regards Yuming

Apr 02 '24 06:04 netspym

You also need to take RAM bandwidth limitations into account.

Apr 02 '24 06:04 kaetemi

It's the same on my 8 core system, 6 threads result in more speed. I think, this is normal.

Apr 02 '24 07:04 supportend

96 cores 192 threads ... the peek inferencing speed tops at around 60 threads

This sounds normal. The CPU may be over-saturated: token generation performance tips readme

Apr 02 '24 12:04 Jeximo

My system's CPU bandwidth is 460gb/s, with 12 sticks of ram installed. It only runs LLM as fast as a M2 Ultra, same speed as Nvidia P40 24gb (Which is only $150 USD on the market) ......

Apr 04 '24 04:04 netspym

I have AMD EPYC 9654 and it has 96 cores 192 threads. When running llama.cpp /main with Yi-34b-chat Q4, the peek inferencing speed tops at around 60 threads. Setting more threads in the command will start slowing down the speed.

Ah, a fellow Epyc user. Could you share llama.cpp statistics from example execution of the model? I will compare them with mine. I also have 12 sticks of RAM installed, but my CPU has only 32 cores (Epyc 9374F).

Apr 05 '24 16:04 fairydreaming

@fairydreaming @netspym i was wondering if you can get better performance by dividing the load for 6 cpu cores each and running mpirun on them for better performance on the same host.

anyone tried this set up before and get better performance? any statistic to show?

Apr 14 '24 19:04 ouvaa

I don't think the performance is THAT bad. My platform is Epyc 9374F on Asus K14PA-U12 motherboard with 12 x Samsung 32GB 2Rx8 4800MHz DDR5 RDIMM M321R4GA3BB6-CQK modules. My system has 460.8 GB/s theoretical memory bandwidth, with the settings below I get MBU (memory bandwidth utilization) of about 60% of the theoretical value, which is not terrible.

I changed the following settings in BIOS:

set NUMA Nodes per Socket to NPS4
enabled ACPI SRAT L3 Cache as NUMA Domain (this increased the number of NUMA nodes to 8)

On Linux I did the following:

disabled numa balancing with echo 0 > /proc/sys/kernel/numa_balancing
added --numa distribute to llama.cpp options
set the number of threads in llama.cpp to 32 with -t 32 (adding more threads hurts the performance)
before loading large models I free the cache memory with: echo 3 > /proc/sys/vm/drop_caches

With this I get prompt eval time 12.70 tokens per sec and eval time 3.93 tokens per second on llama-2-70b-chat.Q8_0.gguf model. Log attached. I will be grateful for any advice on how to further increase the performance.

main.log

Apr 15 '24 07:04 fairydreaming

@fairydreaming have u tried running the mpirun on the same host by splitting into using 6-8 cpu cores each mpirun instance? on the same host. i think should be much faster. right? the "optimum" run is around 6 threads i think. so was wondering if u hv tried the mpirun. so means u can consider splitting into 5 segments for testing. should be faster by 3x. so means u should get 12 tok/s on eval time.

p.s. : after seeing what u have, i'm kind of disappointed with the performance u get. ur system is very powerful but then again it's limited by the bandwidth so i was wondering if there's something to do with the "12 channel" or whatever limit. 60% of theoretical value is terrible. i think 80% is acceptable.

i was disappointed coz i was about to get something similar but it's pricey for such low performance.

i was thinking about getting this board. i wonder if it's limited by the "12 channel" thing https://www.gigabyte.com/Enterprise/Server-Motherboard/MZ33-AR0-rev-1x

Apr 17 '24 08:04 ouvaa

@fairydreaming have u tried running the mpirun on the same host by splitting into using 6-8 cpu cores each mpirun instance? on the same host. i think should be much faster. right? the "optimum" run is around 6 threads i think. so was wondering if u hv tried the mpirun. so means u can consider splitting into 5 segments for testing. should be faster by 3x. so means u should get 12 tok/s on eval time.

@ouvaa, I don't quite understand what do you mean. Do you suggest to run multiple instances of llama.cpp, each on a separate set of CPU cores? I ran such test, but used numactl instead of mpirun. I ran 8 instances of llama.cpp with 13B llama-2-chat Q8 model in parallel, each instance ran on a separate NUMA node, so it used a single CCD with 4 physical cores (no SMT). Note that there are 8 separate copies of the model in memory in this configuration.

The result is that each instance ran with prompt eval time about 9 t/s, while generation was around 3.17 t/s. If my calculations are correct, this corresponds to the MBU value of 100 * (3.11 + 3.19 + 3.14 + 3.15 + 3.18 + 3.21 + 3.15 + 3.21) / (460.3 / 13) = 71.57%. This is nice, but I'm not sure what's the point? I mean if you want to run multiple separate workloads like this you could simply use 8 separate Ryzen 8000 machines, each with 2-channel RAM. It will be even faster and cheaper.

p.s. : after seeing what u have, i'm kind of disappointed with the performance u get. ur system is very powerful but then again it's limited by the bandwidth so i was wondering if there's something to do with the "12 channel" or whatever limit. 60% of theoretical value is terrible. i think 80% is acceptable.

I simply use what's currently available. But if you know of a better hardware for running large LLMs in this price range, I'm all ears.

Apr 17 '24 10:04 fairydreaming

@fairydreaming basically i thought running within the same host should be faster than going through 1gbps line for multiple hosts.

71.57% is more efficient than the 60% u mentioned but still super slow at 3.17t/s. that's really really slow.

yes it's cheaper to run large LLM but... at an excruciating speed. makes me even wonder if it's worth running it. (i expect something like 12.5t/s to be worth the price for large LLM)

like everyone else, trying to squeeze the best bang for the buck. honestly, i want something that can do 25.8 t/s for me to consider buying the equipment.

anyhow now with 281GB mixtral, i'm thinking how to really get it to work.

do mentioned from time to time if you found optimization strategies for your setup.

thx for sharing.

Apr 18 '24 09:04 ouvaa

@fairydreaming basically i thought running within the same host should be faster than going through 1gbps line for multiple hosts.

71.57% is more efficient than the 60% u mentioned but still super slow at 3.17t/s. that's really really slow.

Yeah, but there were 8 independent instances running at the same time. So it was 8 * 3.17 = 25.36 t/s overall. Buf it you run one large model instead of multiple independent models then the efficiency goes down do 60%, I guess there is some overhead caused by communication between CCDs in the CPU.

anyhow now with 281GB mixtral, i'm thinking how to really get it to work.

I have over 6 t/s on mixtral 8x22 (Q8_0) on my Epyc 9374F. Too bad that @netspym didn't share his speed results, maybe they are better with EPYC 9654 (it has 12 CCDs, so there is more bandwidth between the CPU and the memory controller compared to my CPU with 8 CCDs). But the price...

Apr 18 '24 10:04 fairydreaming

This issue was closed because it has been inactive for 14 days since being marked as stale.

Jun 02 '24 01:06 github-actions[bot]