llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Optimize locking behavior

Open janekb04 opened this issue 1 year ago • 6 comments

Port changes from whisper.cpp PR #659

janekb04 avatar Apr 06 '23 14:04 janekb04

I haven't had the time to look in details yet and test this on my Mac. In the meantime, it would be nice to get some reports from other people about how the token generation time is affected by these changes

ggerganov avatar Apr 13 '23 13:04 ggerganov

I run a perf testing on Windows 10, AVX, 10c20t box: Figure_1

howard0su avatar Apr 16 '23 15:04 howard0su

@howard0su Thanks for doing the benchmark. The results seem interesting to me.

It looks like there is a general improvement when using up to 10 threads and a degradation when going further. I think I know why this is the case. There are two explanations:

  1. You don't have HyperThreading enabled (but you probably do)
  2. This code isn't suitable for HyperThreading.

To determine if a piece of code can be sped up by HyperThreading, let's understand what it actually does. HyperThreading was introduced by Intel into their processors to optimize the usage of the CPU's execution engines. Most people think that the CPU just executes instructions one, by one. Some people know that it uses pipelining and out-of-order execution. But fewer people still know that the CPU core actually isn't a single "calculator". Inside the core there are disparate "execution units" or "ports" that are very specialized and are dedicated to specific tasks (see EU in this diagram). For instance, a different piece of hardware handles simple integer arithmetic, a different handles division, a different handles floats, etc. Let's keep in mind that Intel and Windows are designed to optimize for the "general case".

So let's imagine taking a random program. What does it do? How does it use the CPU? Well, this program can be written in any language, including those interpreted at runtime. Usually, programs spend most of their time not on doing actual computations, but rather on fetching data from memory. Also, they usually use each of the execution units roughly equally, intermixing integer and floating point operations. As such, what did Intel come up with? To take advantage of this unused CPU power, they allowed the OS to run two threads on the same core at once. As it is improbable that either thread is completely using the core, allowing them to run in parallel will increase throughput - though, not latency.

Now, in the case of llama.cpp, the story is different. Each thread is constantly doing heavy floating-point calculations. When a given thread is running, it is using the floating point execution unit and SIMD at 100%. But that's OK. This is what HyperThreading is for: the OS will run a second thread in parallel that will use the other execution units... except, it wont. Because the other thread will also be most likely a llama.cpp thread that also uses the CPU's floating point engines at 100%. So, what will happen in reality is that these two threads will contend for the same execution units. And naturally, this will cause various stalls and waits (on the microarchitectural level). Additionally, there will be many context switches (as there are more threads than cores), while if there would be as many threads as there are cores, a single thread could peacefully run for longer. A context switch is detrimental to performance as it basically invalidates the L1 and L2 caches, the register files, etc. This is because the new thread will use its own matrix/tensor and most likely evict the old one.

Why is the performance worse for 15 threads compared to 20 threads? I guess that it is because 15 is harder for the scheduler. 20 threads is 2x cores, so the scheduler is probably optimized for this case and is somehow able to maintain roughly the performance of 10 threads. Meanwhile, 15 threads is some random number, so the scheduler probably schedules the threads in a, basically, random order and just arbitrarily interrupts the computations and shuffles the threads around.

janekb04 avatar Apr 17 '23 08:04 janekb04

@ggerganov Should this be added to the improve threading implementation project?

janekb04 avatar Apr 17 '23 08:04 janekb04

I have 10c20t E5 CPU. So, 10 is definitely a magic number.

howard0su avatar Apr 17 '23 13:04 howard0su

@howard0su But 10 is the number of cores and half the number of threads. 10c20t E5 CPU has 10 cores and 20 threads...

janekb04 avatar Apr 17 '23 14:04 janekb04