llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

[WIP] Improve performance on x86

Open TheSteveHan opened this issue 1 year ago • 5 comments

Someone please take over this pull request?

Unfortunately, I'm quite behind on a few other obligations so I won't be able to continue exploring here. Feel free to take this as an inspiration and make a production ready version!


I did some initial exploration on various ways to squeeze more performance out of the main loop on my ubuntu desktop with an i7-7700k CPU.

Code was complied with gcc-10 and invoked with ./main -m ./models/7B/ggml-model-q4_0.bin -s 1679164839 -m ./models/7B/ggml-model-q4_0.bin -n 1280

Since inference is usually memory bound, I specifically looked for ways to improve memory access. Seems like a combination of prefetch +CPU pinning + unroll can improve the performance by up to ~25%

Screen Shot 2023-03-18 at 5 27 10 PM

These changes here is only tested on my machine and I suspect the code won't even compile for other platforms.

TheSteveHan avatar Mar 19 '23 14:03 TheSteveHan

This worked with gcc 11.3 but gave only a slight improvement. I've been looking at that function as well since it's clearly the hotspot. I was playing around with the AVX2 instructions but it seems to be pretty much memory-bound.

I tried using _mm256_maddubs_epi16 as described here, but didn't see a consistent improvement.

sw avatar Mar 19 '23 20:03 sw

This worked with gcc 11.3 but gave only a slight improvement. I've been looking at that function as well since it's clearly the hotspot. I was playing around with the AVX2 instructions but it seems to be pretty much memory-bound.

I tried using _mm256_maddubs_epi16 as described here, but didn't see a consistent improvement.

The hard coded 32 in the prefetch distance is quite arbitrary. I wonder if different numbers would work better for your machine.

As you reach the limit of your machine, other stuff you have running would start to create more variation of the measured performance too. So you might have to run it multiple times and compare the best performances. The original code runs very consistently at ~425+-5 ms/token for me, however the modified version varies between 340 and 380 between runs for me.

TheSteveHan avatar Mar 19 '23 23:03 TheSteveHan

I was independently trying to do something similar on the Q4_1 code here. I managed to squeeze out somewhere around 5% more performance by rearranging the SIMD math and avoiding a double load on the constant offsets, but saw no improvements from prefetching anything on my setup (a Skylake mobile Xeon, GCC 11).

blackhole89 avatar Mar 21 '23 22:03 blackhole89

but saw no improvements from prefetching anything on my setup (a Skylake mobile Xeon, GCC 11).

I took a quick look at wiki for Skylake mobile Xeon, looks like the L3 cache size there (8MB) is less than the L1 cache (13MB) on this i7700 desktop chip. The prefetch distance in this PR might be way too far for your chip? Here it's also trying to prefetch in to L1, you might have better luck prefetching into L3 given the smaller cache size?

TheSteveHan avatar Mar 21 '23 23:03 TheSteveHan

Personally noticed improvement on 10700kf on windows 10

From 270 ms to 241 ms on 13b alpaca, although only part I added from this commit was the main loop modification since no #include <sched.h>on windows and thread stuff would need adjustment i assume to work on windows aswell

performance gains prolly could be bigger on 30 b and 65 b models, aswell as if I got thread affinity stuff going on windows

x02Sylvie avatar Mar 25 '23 17:03 x02Sylvie

Please reopen when and if this is ready to merge

ggerganov avatar Apr 13 '23 12:04 ggerganov