llama.cpp threads: changing to a mutex/condvar based thread pool.

This is a try to change the threading in ggml from busy wait spin locking to mutex/condvar based thread pool. I don’t think this should be merged - it adds a dependency on copied https://github.com/Pithikos/C-Thread-Pool also hacked to work on windows to see what the effect on performance / energy usage would be. but maybe it will inspire further work on this.

The motivation is energy consumption, this pr gets cpu usage from 700% to 400% on the 8 threads on mac run while slightly making the time per token eval worse, maybe some similarish like cpu saving on other platforms.

I timed few various runs below and also will add activity monitor screenshots thread pool vs. master which i will add in comments.

I can't explain the the cpu savings properly though, main thread also seem to take part in computation, hm

Apr 02 '23 12:04 bogdad

Thread pool Mac 7B threads 6 n_predict 64

llama_print_timings:        load time =   625.00 ms
llama_print_timings:      sample time =    46.18 ms /    64 runs   (    0.72 ms per run)
llama_print_timings: prompt eval time =  1068.37 ms /    21 tokens (   50.87 ms per token)
llama_print_timings:        eval time =  6690.41 ms /    63 runs   (  106.20 ms per run)
llama_print_timings:       total time =  8006.15 ms

Master Mac 7B threads 6 n_predict 64

llama_print_timings:        load time =   570.73 ms
llama_print_timings:      sample time =    30.81 ms /    42 runs   (    0.73 ms per run)
llama_print_timings: prompt eval time =   907.05 ms /    21 tokens (   43.19 ms per token)
llama_print_timings:        eval time =  2208.34 ms /    41 runs   (   53.86 ms per run)
llama_print_timings:       total time =  3343.33 ms

Windows thread pool

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 64, n_keep = 0

llama_print_timings:        load time =  1003.66 ms
llama_print_timings:      sample time =    43.09 ms /    64 runs   (    0.67 ms per run)
llama_print_timings: prompt eval time =  1652.55 ms /    21 tokens (   78.69 ms per token)
llama_print_timings:        eval time = 11181.85 ms /    63 runs   (  177.49 ms per run)
llama_print_timings:       total time = 13169.05 ms

Windows master

system_info: n_threads = 32 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 64, n_keep = 0


 Below is an instruction that describes a task. Write a response that appropriately completes the request.
Sarah and Eric have been dating for two years, but they don’t want to get married yet. Sarah wants to spend some time with Eric alone before committing to marriage. She asks him what he thinks of the idea of going on a singles cruise together as their next date night.

llama_print_timings:        load time =   880.14 ms
llama_print_timings:      sample time =    53.33 ms /    64 runs   (    0.83 ms per run)
llama_print_timings: prompt eval time =  1045.64 ms /    21 tokens (   49.79 ms per token)
llama_print_timings:        eval time = 10429.93 ms /    63 runs   (  165.55 ms per run)
llama_print_timings:       total time = 11820.35 ms
(

threadpool, Mac 65B, threads 8, n_predict 64

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 64, n_keep = 0
...
llama_print_timings:        load time =  5735.31 ms
llama_print_timings:      sample time =    27.86 ms /    38 runs   (    0.73 ms per run)
llama_print_timings: prompt eval time = 11073.08 ms /    21 tokens (  527.29 ms per token)
llama_print_timings:        eval time = 21568.16 ms /    37 runs   (  582.92 ms per run)
llama_print_timings:       total time = 33071.62 ms

master Mac 65B, threads 8, n_predict 64

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 64, n_keep = 0

llama_print_timings:        load time =  5882.47 ms
llama_print_timings:      sample time =    48.56 ms /    64 runs   (    0.76 ms per run)
llama_print_timings: prompt eval time = 10561.09 ms /    21 tokens (  502.91 ms per token)
llama_print_timings:        eval time = 30989.55 ms /    63 runs   (  491.90 ms per run)
llama_print_timings:       total time = 42014.82 ms

Apr 02 '23 12:04 bogdad

activity monitor for thread pool Screenshot 2023-04-02 at 13 47 58

master Screenshot 2023-04-02 at 13 45 19

sampling profiler, just for reference thread pool: Screenshot 2023-04-02 at 13 17 35

master

Apr 02 '23 12:04 bogdad

Converting to draft since even the author of the PR does not think this should be merged as is:

I don’t think this should be merged

Apr 02 '23 12:04 prusnak

i was thinking how to easily show cpu time spent spinlocking vs being blocked on the threadpool -

this change https://github.com/bogdad/llama.cpp/pull/7/files extracts the spinning portions of ggml_graph_compute and ggml_graph_compute_thread on master to separate functions that can be seen in the sampling profiler. hopefully they don't change master too much.

instruments run with the large prompt - dan. ./build/bin/main -m ./models/65B/ggml-model-q4_0.bin --color -f ./prompts/dan.txt -n 64 -t 8 Screenshot 2023-04-03 at 20 36 38

or 7b model ./build/bin/main -m ./models/7B/ggml-model-q4_0.bin --color -f ./prompts/dan.txt -n 64 -t 8 Screenshot 2023-04-03 at 20 48 25

compared to the thread pool Screenshot 2023-04-03 at 21 00 50

i think that shows that spinning amounts to 20% on the large prompts, large model, or 40% small model large prompts, unless i am misreading profiler output.

Apr 03 '23 19:04 bogdad

I think this is right direction. Spinning do take quite a bit CPU. In my perf run, it is about 30% on Windows 10.

I suggest you look at how to spin a few cycles before entering lock if main thread is doing the preparation. (not sure if this is optimization already done in threadpool). And you may want to consider to set cpu affinity as well.

Apr 04 '23 00:04 howard0su

I removed the code to run FINALIZE as all operator's finalize branch is empty.

With This Change:

Running with 8 threads...
         8 threads | run 1/4 | current token time 322.61 ms - eval time 38633.45 ms - prompt eval time 2580.88 ms
         8 threads | run 2/4 | current token time 320.44 ms - eval time 39142.37 ms - prompt eval time 2563.55 ms
         8 threads | run 3/4 | current token time 302.37 ms - eval time 39437.33 ms - prompt eval time 2418.94 ms
         8 threads | run 4/4 | current token time 309.38 ms - eval time 39101.15 ms - prompt eval time 2475.01 ms
Running with 12 threads...
         12 threads | run 1/4 | current token time 317.88 ms - eval time 47406.18 ms - prompt eval time 2543.04 ms
         12 threads | run 2/4 | current token time 319.62 ms - eval time 47466.82 ms - prompt eval time 2556.97 ms
         12 threads | run 3/4 | current token time 320.63 ms - eval time 47356.98 ms - prompt eval time 2565.01 ms
         12 threads | run 4/4 | current token time 310.91 ms - eval time 47467.93 ms - prompt eval time 2487.31 ms
Running with 16 threads...
         16 threads | run 1/4 | current token time 251.6 ms - eval time 39880.11 ms - prompt eval time 2012.83 ms
         16 threads | run 2/4 | current token time 252.94 ms - eval time 39864.13 ms - prompt eval time 2023.53 ms
         16 threads | run 3/4 | current token time 254.65 ms - eval time 40030.51 ms - prompt eval time 2037.18 ms
         16 threads | run 4/4 | current token time 247.06 ms - eval time 39718.12 ms - prompt eval time 1976.49 ms
Running with 20 threads...
         20 threads | run 1/4 | current token time 230.93 ms - eval time 40994.86 ms - prompt eval time 1847.46 ms
         20 threads | run 2/4 | current token time 247.3 ms - eval time 39370.09 ms - prompt eval time 1978.36 ms
         20 threads | run 3/4 | current token time 243.99 ms - eval time 40523.65 ms - prompt eval time 1951.9 ms
         20 threads | run 4/4 | current token time 243.74 ms - eval time 38985.39 ms - prompt eval time 1949.94 ms

Master:

Running with 8 threads...
         8 threads | run 1/4 | current token time 387.52 ms - eval time 50231.0 ms - prompt eval time 3100.13 ms
         8 threads | run 2/4 | current token time 364.58 ms - eval time 50275.85 ms - prompt eval time 2916.65 ms
         8 threads | run 3/4 | current token time 389.67 ms - eval time 50712.58 ms - prompt eval time 3117.39 ms
         8 threads | run 4/4 | current token time 383.34 ms - eval time 50331.24 ms - prompt eval time 3066.7 ms
Running with 12 threads...
         12 threads | run 1/4 | current token time 317.38 ms - eval time 41190.92 ms - prompt eval time 2539.05 ms
         12 threads | run 2/4 | current token time 317.49 ms - eval time 41369.19 ms - prompt eval time 2539.9 ms
         12 threads | run 3/4 | current token time 324.44 ms - eval time 41333.93 ms - prompt eval time 2595.49 ms
         12 threads | run 4/4 | current token time 315.98 ms - eval time 40918.64 ms - prompt eval time 2527.88 ms
Running with 16 threads...
         16 threads | run 1/4 | current token time 265.92 ms - eval time 33277.19 ms - prompt eval time 2127.36 ms
         16 threads | run 2/4 | current token time 244.35 ms - eval time 33022.23 ms - prompt eval time 1954.79 ms
         16 threads | run 3/4 | current token time 250.27 ms - eval time 33235.78 ms - prompt eval time 2002.12 ms
         16 threads | run 4/4 | current token time 250.62 ms - eval time 32954.86 ms - prompt eval time 2004.93 ms
Running with 20 threads...
         20 threads | run 1/4 | current token time 431.94 ms - eval time 73713.41 ms - prompt eval time 3455.48 ms
         20 threads | run 2/4 | current token time 353.77 ms - eval time 76674.15 ms - prompt eval time 2830.16 ms
         20 threads | run 3/4 | current token time 398.88 ms - eval time 107782.13 ms - prompt eval time 3191.0 ms
         20 threads | run 4/4 | current token time 455.42 ms - eval time 110819.35 ms - prompt eval time 3643.36 ms

The graph is very strange. (Legend is wrong: main -> this change, rope opt is master) Figure_1

Apr 07 '23 15:04 howard0su

very cool! agree, strange indeed, i would expect master to be faster than the thread pool, hm. Is it that doing nothing (no finalize) with spinlocks so much faster than doing nothing with a thread pool? so that after we remove it thread pool becomes faster? (if i understood your change correctly)

with this thread pool change the question for me is how to do it properly so not to bring the threadpool dependency into ggml and maybe keep it in the user code - ie llama cpp,

one way to do this is to extract some kind "schedule work" interface that ggml could use, but i did not get a chance to work on this further yet

Apr 07 '23 17:04 bogdad

Check this branch: https://github.com/howard0su/llama.cpp/tree/tp_schedule

Overall, I believe we can make threadpool faster. But the current thread pool implementation is suboptimal. We may need to look at some lock free queue to replace it. But the first thing is implementing a better scheduling algorithm.

ggml or llma, it is debatable. I would prefer having a better graph scheduler to replace the current one.

Apr 08 '23 14:04 howard0su

Can we get a short TLDR? Adding thpool.h / thpool.c is not an option

Apr 13 '23 13:04 ggerganov

Can one of the admins verify this patch?

Sep 15 '23 13:09 alitariq4589

oh, i missed that.

the tldr: this is just an exploration how of how llama.cpp would behave if there were no busy waiting. was not supposed to be merged, because it has some external thread pool implementation.

since then i think work scheduling in llama.cpp moved a lot, i think i will close this.

feel free to use the patch, but very likely it diverged a lot from the main. or reopen is also fine by me if there is value in this

Sep 15 '23 13:09 bogdad

llama.cpp llama.cpp copied to clipboard

threads: changing to a mutex/condvar based thread pool.

llama.cpp
llama.cpp copied to clipboard