Use thread pool
Inspired by this discussion I tried to understand the work scheduling. The patch uses a plain pthreads emulation for Windows and a thread pool implementation from the same site with what I hope are corrections.
The patch is only tested on Windows. ~~When testing I noticed that test_opt seems to need quite a bit more run time than earlier. Is this a regression or due to a bug fix?~~ (This was due to different test sizes.)
Here are the results of a quick check with
./bin/release/starcoder --threads 12 -m models/gpt_bigcode-santacoder/ggml-q4_0.bin -p "def test(" --temp 0
Before:
main: mem per token = 330184 bytes
main: load time = 1162.09 ms
main: sample time = 11.67 ms
main: predict time = 18299.45 ms / 90.59 ms per token
main: total time = 19597.86 ms
After:
main: mem per token = 330184 bytes
main: load time = 1168.33 ms
main: sample time = 13.62 ms
main: predict time = 11272.42 ms / 55.80 ms per token
main: total time = 12591.08 ms
Here the numbers for
./bin/release/starcoder --threads 4 -m models/gpt_bigcode-santacoder/ggml-q4_0.bin -p "def test(" --temp 0
Before:
main: mem per token = 330056 bytes
main: load time = 1189.20 ms
main: sample time = 11.53 ms
main: predict time = 9613.42 ms / 47.59 ms per token
main: total time = 10968.48 ms
After:
main: mem per token = 330056 bytes
main: load time = 1203.09 ms
main: sample time = 12.97 ms
main: predict time = 9725.99 ms / 48.15 ms per token
main: total time = 11100.05 ms
Run on a Core i7 with 6 cores built with clang-cl and without BLAS support.
Tagging @JohannesGaessler since I think he was also planning on doing something similar.
I'm now going to look into @ggerganov 's PR. Back again: I need guidance on how to merge the different approaches.
- If code is taken verbatim from an external source I think we should document this.
Absolutely. Assorted associations:
- I'd love to use some kind of standard
pthreadsthread pool, but this is the 'best' I found - The presentation on the website is expository
- I had to apply fixes to make the code compile
- I had to apply fixes to make the code run
- The code looks to me a bit like a text book implementation
- The code in question is licensed as MIT so there should be no legal issues.
Maybe we should try to contact the author what to do?
Thanks for looking into this!
There are a couple more questions I'd like to discuss:
- The pool is currently located in
ggml_state. Should we manage it inggml_contextinstead? - Can we improve error handling?
- How would you like the condition variable to signal: during locks held or after releasing?
I had to apply fixes to make the code compile I had to apply fixes to make the code run
Please document this as well.
On my Linux system with a Ryzen 3700X, 3200 MHz dual channel memory, and an RTX 3090 this PR has worse performance than master:
| GPU | -ngl | Test | ms/t master | ms/t PR | Speedup |
|---|---|---|---|---|---|
| RTX 3090 | 0 | tg128 | 402.21 | 426.87 | 0.94 |
| RTX 3090 | 20 | tg128 | 224.68 | 238.41 | 0.94 |
| RTX 3090 | 40 | tg128 | 42.62 | 51.55 | 0.83 |
Edit: the numbers are also for starcoder-mmap.
Thanks for testing!
I changed the order of unlocking/signaling and retested with a bigger model
./bin/release/mpt -m models\mpt-7b-storywriter\ggml-model-f16.bin --threads n --temp 0 --prompt "..."
for n = 4/6 and with Clang and ICX/MKL. Results for the compilers are similar so I'm showing Clang results.
Master 4 threads/6 threads
main: sampled tokens = 200
main: mem per token = 365968 bytes
main: load time = 16946.88 ms
main: sample time = 11.75 ms / 0.06 ms per token
main: eval time = 108975.37 ms / 544.88 ms per token
main: total time = 128000.77 ms
main: sampled tokens = 200
main: mem per token = 366000 bytes
main: load time = 17137.89 ms
main: sample time = 12.15 ms / 0.06 ms per token
main: eval time = 98924.89 ms / 494.62 ms per token
main: total time = 117962.98 ms
PR 4 threads/6 threads
4 thread
main: sampled tokens = 200
main: mem per token = 365968 bytes
main: load time = 17143.71 ms
main: sample time = 14.17 ms / 0.07 ms per token
main: eval time = 105632.52 ms / 528.16 ms per token
main: total time = 124864.12 ms
main: sampled tokens = 200
main: mem per token = 366000 bytes
main: load time = 16956.28 ms
main: sample time = 14.93 ms / 0.07 ms per token
main: eval time = 92799.80 ms / 464.00 ms per token
main: total time = 111578.88 ms
I'm not sure how to proceed.
Have you tested with llama? If I understand correctly, this will cause the threads to wait on a mutex for each node in the graph. In normal conditions, a spin lock should have lower acquire overhead than a mutex (excluding edge cases such as using more threads than physically available). Since that overhead is paid for every node, the overhead will scale with the number of nodes in the graph.
For what it's worth, what I was planning to do for https://github.com/ggerganov/llama.cpp/pull/2239 was to keep the spin locks while evaluating a graph, while keeping the threads alive waiting on a mutex between calls to ggml_graph_compute, to avoid having to re-launch them on every graph.
If the performance is not universally better I think this should be added as a compile option so that both implementations are still available. To get a better grasp regarding which should be the default I think we would need more data.
The pool is currently located in ggml_state. Should we manage it in ggml_context instead?
@ggerganov is the authority to ask regarding ggml architecture.
To get back to what I said earlier: with the current design the threads need to wait and be notified for each tensor. With a design that queues all work ahead of time you could instead have the thread wait on a barrier (but I'm not sure whether that would be faster).
Maybe we should wait for slaren's backend refactor? From my perspective the biggest issue with threading is that you manually have to set the threads to 1 to get optimal performance with llama.cpp CUDA which is an inconvenience to the user and suboptimal for performance if you offload only part of the network.
Keep in mind that the backend refactor is still going to take a while (several weeks), and I wouldn't want it to cause a delay to any possible improvements that could be made in the meanwhile. I understand that it is a bit of an uncomfortable situation because the changes needed to fix that issue for CUDA may be irrelevant after the backends refactor, so ultimately it is up to you.
With the introduction of ggml_threadpool (https://github.com/ggerganov/llama.cpp/pull/8672) this is now obsolete.