llama.cpp
llama.cpp copied to clipboard
Use Threadpool to schedule the work
I cannot use this code to full utilize all CPU. based on PR #710 :
- Remove finalizer
- use similar tech like PR #850
- optimize thread pool itself to avoid malloc/free in the critical path.
Additional Ideas to improve this PR:
- Use some lockfree queue to replace mutex. only use conditional variable to wake up for new work. we may not even need this thread pool implementation.
- Test more with BLAS to see if it helps. tune the settings of when we use BLAS.
- smartly divide the work to n-threads. n should not be always the core number. It should relate to the size of parameter tensors.
- Topo sort of the graph. The current challenge is that graph is dynamic but largely they are static except some parameters. We need to evaluate how quick we can do a topo sort.
@ggerganov for comments.
@howard0su : I think you may have added some commits accidentally? There's already a PR for 997c749, I believe: #809
It helps me my local debug. I will revert this from this PR when this PR gets to better state.
In the meantime, please help review that PR and merge it if it is proper.
Call for some testing. The current change shows both performance improvement and energy improvement. However my devbox is a 10cores, 20threads E5 without AVX2. It is not very typical config. I need some help to validate the performance.
This is a test result in my machine. Eval time is still having some regression. I believe this is because when we are doing eval instead of prompt eval, the token output is 1, which every thread's work is smaller so the scheduling effort shows bigger impact to the performance.
I'll look in more details into the threading work after I finish with the quantization improvement efforts and some other pending stuff.
But at the moment, I can immediately say that these extra thpool
files cannot stay like this. At some point in the past, I had implemented a thread pool inside ggml.c
and it was very compact. If we really do need a thread pool, it should be implemented in a minimalistic way inside ggml.c
.
did run the benchmark script on m1 mac, its up to 10 threads, orange is threadpool, blue is master-(when the branch was forked=eeaa7b0492fc79baab8bb1fe195d6c87159f2bd3)
token time:
cant explain why we cant see the windows improvements, except for when we are at 10 threads where thread pool is better.
threadpool 4,85.79 6,60.352 8,50.604 10,60.007999999999996
master: 4,82.356 6,57.628 8,47.834 10,140.958
it may relate to that pthread functions are different. Do you mind checking if switching to lock-free queue will help?
When you using up all threads, my testing also shows threadpool is significant better. but the overall time is not lower than max-2 threads.