llama.cpp Use Threadpool to schedule the work

I cannot use this code to full utilize all CPU. based on PR #710 :

Remove finalizer
use similar tech like PR #850
optimize thread pool itself to avoid malloc/free in the critical path.

Apr 08 '23 14:04 howard0su

Additional Ideas to improve this PR:

Use some lockfree queue to replace mutex. only use conditional variable to wake up for new work. we may not even need this thread pool implementation.
Test more with BLAS to see if it helps. tune the settings of when we use BLAS.
smartly divide the work to n-threads. n should not be always the core number. It should relate to the size of parameter tensors.
Topo sort of the graph. The current challenge is that graph is dynamic but largely they are static except some parameters. We need to evaluate how quick we can do a topo sort.

@ggerganov for comments.

Apr 08 '23 14:04 howard0su

@howard0su : I think you may have added some commits accidentally? There's already a PR for 997c749, I believe: #809

Apr 09 '23 14:04 sw

It helps me my local debug. I will revert this from this PR when this PR gets to better state.

In the meantime, please help review that PR and merge it if it is proper.

Apr 10 '23 03:04 howard0su

Call for some testing. The current change shows both performance improvement and energy improvement. However my devbox is a 10cores, 20threads E5 without AVX2. It is not very typical config. I need some help to validate the performance.

Apr 10 '23 14:04 howard0su

This is a test result in my machine. Eval time is still having some regression. I believe this is because when we are doing eval instead of prompt eval, the token output is 1, which every thread's work is smaller so the scheduling effort shows bigger impact to the performance.

Figure_1

Apr 13 '23 00:04 howard0su

I'll look in more details into the threading work after I finish with the quantization improvement efforts and some other pending stuff.

But at the moment, I can immediately say that these extra thpool files cannot stay like this. At some point in the past, I had implemented a thread pool inside ggml.c and it was very compact. If we really do need a thread pool, it should be implemented in a minimalistic way inside ggml.c.

Apr 13 '23 13:04 ggerganov

did run the benchmark script on m1 mac, its up to 10 threads, orange is threadpool, blue is master-(when the branch was forked=eeaa7b0492fc79baab8bb1fe195d6c87159f2bd3)

token time: master-blue-vs-threadpool-orange

cant explain why we cant see the windows improvements, except for when we are at 10 threads where thread pool is better.

scripts bench.py plot.py

threadpool 4,85.79 6,60.352 8,50.604 10,60.007999999999996

master: 4,82.356 6,57.628 8,47.834 10,140.958

Apr 15 '23 13:04 bogdad

it may relate to that pthread functions are different. Do you mind checking if switching to lock-free queue will help?

When you using up all threads, my testing also shows threadpool is significant better. but the overall time is not lower than max-2 threads.

Apr 16 '23 13:04 howard0su

llama.cpp llama.cpp copied to clipboard

Use Threadpool to schedule the work

llama.cpp
llama.cpp copied to clipboard