ggml Use thread pool

Inspired by this discussion I tried to understand the work scheduling. The patch uses a plain pthreads emulation for Windows and a thread pool implementation from the same site with what I hope are corrections.

The patch is only tested on Windows. ~~When testing I noticed that test_opt seems to need quite a bit more run time than earlier. Is this a regression or due to a bug fix?~~ (This was due to different test sizes.)

Here are the results of a quick check with

./bin/release/starcoder --threads 12 -m models/gpt_bigcode-santacoder/ggml-q4_0.bin -p "def test(" --temp 0

Before:

main: mem per token =   330184 bytes
main:     load time =  1162.09 ms
main:   sample time =    11.67 ms
main:  predict time = 18299.45 ms / 90.59 ms per token
main:    total time = 19597.86 ms

After:

main: mem per token =   330184 bytes
main:     load time =  1168.33 ms
main:   sample time =    13.62 ms
main:  predict time = 11272.42 ms / 55.80 ms per token
main:    total time = 12591.08 ms

Here the numbers for

./bin/release/starcoder --threads 4 -m models/gpt_bigcode-santacoder/ggml-q4_0.bin -p "def test(" --temp 0

Before:

main: mem per token =   330056 bytes
main:     load time =  1189.20 ms
main:   sample time =    11.53 ms
main:  predict time =  9613.42 ms / 47.59 ms per token
main:    total time = 10968.48 ms

After:

main: mem per token =   330056 bytes
main:     load time =  1203.09 ms
main:   sample time =    12.97 ms
main:  predict time =  9725.99 ms / 48.15 ms per token
main:    total time = 11100.05 ms

Run on a Core i7 with 6 cores built with clang-cl and without BLAS support.

Jul 20 '23 11:07 goerch

Tagging @JohannesGaessler since I think he was also planning on doing something similar.

Jul 20 '23 13:07 slaren

I'm now going to look into @ggerganov 's PR. Back again: I need guidance on how to merge the different approaches.

Jul 20 '23 14:07 goerch

If code is taken verbatim from an external source I think we should document this.

Absolutely. Assorted associations:

I'd love to use some kind of standard pthreads thread pool, but this is the 'best' I found
The presentation on the website is expository
I had to apply fixes to make the code compile
I had to apply fixes to make the code run
The code looks to me a bit like a text book implementation

The code in question is licensed as MIT so there should be no legal issues.

Maybe we should try to contact the author what to do?

Thanks for looking into this!

Jul 21 '23 10:07 goerch

There are a couple more questions I'd like to discuss:

The pool is currently located in ggml_state. Should we manage it in ggml_context instead?
Can we improve error handling?
How would you like the condition variable to signal: during locks held or after releasing?

Jul 21 '23 11:07 goerch

I had to apply fixes to make the code compile I had to apply fixes to make the code run

Please document this as well.

Jul 21 '23 13:07 JohannesGaessler

On my Linux system with a Ryzen 3700X, 3200 MHz dual channel memory, and an RTX 3090 this PR has worse performance than master:

GPU	-ngl	Test	ms/t master	ms/t PR	Speedup
RTX 3090	0	tg128	402.21	426.87	0.94
RTX 3090	20	tg128	224.68	238.41	0.94
RTX 3090	40	tg128	42.62	51.55	0.83

Edit: the numbers are also for starcoder-mmap.

Jul 22 '23 21:07 JohannesGaessler

Thanks for testing!

I changed the order of unlocking/signaling and retested with a bigger model

./bin/release/mpt -m models\mpt-7b-storywriter\ggml-model-f16.bin --threads n --temp 0 --prompt "..."

for n = 4/6 and with Clang and ICX/MKL. Results for the compilers are similar so I'm showing Clang results.

Master 4 threads/6 threads

main: sampled tokens =      200
main:  mem per token =   365968 bytes
main:      load time = 16946.88 ms
main:    sample time =    11.75 ms / 0.06 ms per token
main:      eval time = 108975.37 ms / 544.88 ms per token
main:     total time = 128000.77 ms

main: sampled tokens =      200
main:  mem per token =   366000 bytes
main:      load time = 17137.89 ms
main:    sample time =    12.15 ms / 0.06 ms per token
main:      eval time = 98924.89 ms / 494.62 ms per token
main:     total time = 117962.98 ms

PR 4 threads/6 threads

 4 thread
main: sampled tokens =      200
main:  mem per token =   365968 bytes
main:      load time = 17143.71 ms
main:    sample time =    14.17 ms / 0.07 ms per token
main:      eval time = 105632.52 ms / 528.16 ms per token
main:     total time = 124864.12 ms

main: sampled tokens =      200
main:  mem per token =   366000 bytes
main:      load time = 16956.28 ms
main:    sample time =    14.93 ms / 0.07 ms per token
main:      eval time = 92799.80 ms / 464.00 ms per token
main:     total time = 111578.88 ms

I'm not sure how to proceed.

Jul 23 '23 10:07 goerch

Have you tested with llama? If I understand correctly, this will cause the threads to wait on a mutex for each node in the graph. In normal conditions, a spin lock should have lower acquire overhead than a mutex (excluding edge cases such as using more threads than physically available). Since that overhead is paid for every node, the overhead will scale with the number of nodes in the graph. For what it's worth, what I was planning to do for https://github.com/ggerganov/llama.cpp/pull/2239 was to keep the spin locks while evaluating a graph, while keeping the threads alive waiting on a mutex between calls to ggml_graph_compute, to avoid having to re-launch them on every graph.

Jul 23 '23 10:07 slaren

If the performance is not universally better I think this should be added as a compile option so that both implementations are still available. To get a better grasp regarding which should be the default I think we would need more data.

The pool is currently located in ggml_state. Should we manage it in ggml_context instead?

@ggerganov is the authority to ask regarding ggml architecture.

Jul 23 '23 10:07 JohannesGaessler

To get back to what I said earlier: with the current design the threads need to wait and be notified for each tensor. With a design that queues all work ahead of time you could instead have the thread wait on a barrier (but I'm not sure whether that would be faster).

Maybe we should wait for slaren's backend refactor? From my perspective the biggest issue with threading is that you manually have to set the threads to 1 to get optimal performance with llama.cpp CUDA which is an inconvenience to the user and suboptimal for performance if you offload only part of the network.

Jul 23 '23 11:07 JohannesGaessler

Keep in mind that the backend refactor is still going to take a while (several weeks), and I wouldn't want it to cause a delay to any possible improvements that could be made in the meanwhile. I understand that it is a bit of an uncomfortable situation because the changes needed to fix that issue for CUDA may be irrelevant after the backends refactor, so ultimately it is up to you.

Jul 23 '23 11:07 slaren

With the introduction of ggml_threadpool (https://github.com/ggerganov/llama.cpp/pull/8672) this is now obsolete.

Oct 16 '24 18:10 ggerganov