llama.cpp Threadpool: take 2

ref: original PR #7526

Added an API to support explicit management and fine-grain control of threadpools. The API supports creating different threadpools for various parts of execution, e.g. batch, single-token, etc. Each threadpool can be created, paused, resumed, and released independently from any other threadpools. This mitigates the overhead of starting/stopping threads for each decode call and helps OSes keep track of scheduling history in order to make better scheduling decisions.

Each threadpool supports:

Setting number of threads (duh) Setting a CPU mask for threads to be placed on Support for strict/relaxed placement: pinning specific threads to specific cores, or letting the OS decide Support for polling/interrupt-driven wait Setting thread priority Using threadpools explicitly is optional. If a llama_decode is called with a llama_context that doesn't have a threadpool attached, a disposable threadpool is created (same as the current behavior). If users choose to explicitly use threadpools, they have to manage them manually. See example in main.cpp.

With all the bells and whistles enabled, we generally see a minor improvement vs OMP. Without polling, threadpool runs on par with OMP.

[x] I have read the contributing guidelines
Self-reported review complexity:
- [ ] Low
- [x] Medium
- [ ] High

Jul 24 '24 15:07 fmz

Here are some perf figures:

On W-2225 Xeon machine: CPU backend:

CPU	Model	Test	t/s master	t/s threadpool	Speedup
Intel(R) Xeon(R) W-2225 CPU @ 4.10GHz	llama 7B Q4_0	pp512	17.46	17.51	1.00
Intel(R) Xeon(R) W-2225 CPU @ 4.10GHz	llama 7B Q4_0	tg128	6.98	7.06	1.01

Intel 10th-gen CPU: ./scripts/compare-commits.sh master threadpool -t 1,2,4,6,8,10

CPU	Model	Threads	Test	t/s master	t/s threadpool-attempt-2	Speedup
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	1	pp512	3.93	3.94	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	1	tg128	2.43	2.44	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	2	pp512	7.13	7.06	0.99
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	2	tg128	4.37	4.36	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	4	pp512	11.96	11.99	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	4	tg128	6.79	6.77	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	6	pp512	14.96	14.98	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	6	tg128	7.51	7.53	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	8	pp512	13.06	13.09	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	8	tg128	6.88	6.83	0.99
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	10	pp512	14.08	14.06	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	10	tg128	7.49	7.52	1.00

Mobile NVIDIA 3060: $ LLAMA_CUDA=1 ./scripts/compare-commits.sh master threadpool -nkvo 0,1

GPU	Model	NKVO	Test	t/s master	t/s threadpool-attempt-2	Speedup
RTX 3060 Laptop GPU	llama 7B Q4_0	No	pp512	1644.73	1642.34	1.00
RTX 3060 Laptop GPU	llama 7B Q4_0	No	tg128	65.94	65.89	1.00
RTX 3060 Laptop GPU	llama 7B Q4_0	Yes	pp512	287.28	286.44	1.00
RTX 3060 Laptop GPU	llama 7B Q4_0	Yes	tg128	54.56	54.32	1.00

Jul 24 '24 16:07 fmz

@slaren Threadpool is back! Updated it a bit to be aligned with the latest graph-compute design. The current performance is largely on par with OpenMP. Please lmk if you have any comments/suggestions?

Jul 26 '24 16:07 fmz

I tried to test this on macOS, but it seems to deadlock.

WARNING: ThreadSanitizer: data race (pid=62377)
  Write of size 1 at 0x00010ab02a8e by main thread:
    #0 ggml_graph_compute ggml.c:19365 (llama-bench:arm64+0x10003fb54)
    #1 ggml_backend_cpu_graph_compute ggml-backend.c:822 (llama-bench:arm64+0x1000a5f1c)
    #2 ggml_backend_graph_compute_async ggml-backend.c:282 (llama-bench:arm64+0x10009bac0)
    #3 ggml_backend_sched_compute_splits ggml-backend.c:1795 (llama-bench:arm64+0x1000a3190)
    #4 ggml_backend_sched_graph_compute_async ggml-backend.c:1979 (llama-bench:arm64+0x1000a2d24)
    #5 llama_graph_compute(llama_context&, ggml_cgraph*, int, ggml_compute_threadpool*) llama.cpp:14412 (llama-bench:arm64+0x100292cac)
    #6 llama_decode_internal(llama_context&, llama_batch) llama.cpp:14666 (llama-bench:arm64+0x1000fda4c)
    #7 llama_decode llama.cpp:18489 (llama-bench:arm64+0x1000fc460)
    #8 test_prompt(llama_context*, int, int, int, int) llama-bench.cpp:1319 (llama-bench:arm64+0x10062bd5c)
    #9 main llama-bench.cpp:1454 (llama-bench:arm64+0x100627180)

  Previous read of size 1 at 0x00010ab02a8e by thread T12 (mutexes: write M0):
    #0 ggml_graph_compute_check_for_work ggml.c:19152 (llama-bench:arm64+0x100053a10)
    #1 ggml_graph_compute_secondary_thread ggml.c:19189 (llama-bench:arm64+0x1000537dc)

  Location is heap block of size 192 at 0x00010ab02a00 allocated by main thread:
    #0 posix_memalign <null>:96112260 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x564c0)
    #1 ggml_aligned_malloc ggml.c:241 (llama-bench:arm64+0x10001ac88)
    #2 ggml_create_threadpool_impl ggml.c:19214 (llama-bench:arm64+0x10003f14c)
    #3 ggml_create_threadpool ggml.c:19292 (llama-bench:arm64+0x10003f0b8)
    #4 main llama-bench.cpp:1418 (llama-bench:arm64+0x100626cd0)

  Mutex M0 (0x00010ab02a00) created at:
    #0 pthread_mutex_init <null>:96112260 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x31470)
    #1 ggml_create_threadpool_impl ggml.c:19238 (llama-bench:arm64+0x10003f404)
    #2 ggml_create_threadpool ggml.c:19292 (llama-bench:arm64+0x10003f0b8)
    #3 main llama-bench.cpp:1418 (llama-bench:arm64+0x100626cd0)

  Thread T12 (tid=36579987, running) created by main thread at:
    #0 pthread_create <null>:96112260 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x3062c)
    #1 ggml_create_threadpool_impl ggml.c:19277 (llama-bench:arm64+0x10003f638)
    #2 ggml_create_threadpool ggml.c:19292 (llama-bench:arm64+0x10003f0b8)
    #3 main llama-bench.cpp:1418 (llama-bench:arm64+0x100626cd0)

SUMMARY: ThreadSanitizer: data race ggml.c:19365 in ggml_graph_compute

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x000000012481fc00) at ggml.c:19132:5
    frame #2: 0x0000000104ba17ec llama-bench`ggml_graph_compute(cgraph=0x00000001182901b8, cplan=0x000000016b28a730) at ggml.c:19373:5
    frame #3: 0x0000000104be8400 llama-bench`ggml_backend_cpu_graph_compute(backend=0x00006000039680b0, cgraph=0x00000001182901b8) at ggml-backend.c:822:12
    frame #4: 0x0000000104be23c4 llama-bench`ggml_backend_graph_compute_async(backend=0x00006000039680b0, cgraph=0x00000001182901b8) at ggml-backend.c:282:12
    frame #5: 0x0000000104be6834 llama-bench`ggml_backend_sched_compute_splits(sched=0x0000000115000000) at ggml-backend.c:1795:35
    frame #6: 0x0000000104be65a0 llama-bench`ggml_backend_sched_graph_compute_async(sched=0x0000000115000000, graph=0x0000000118420020) at ggml-backend.c:1979:12
    frame #7: 0x0000000104d09f44 llama-bench`llama_graph_compute(lctx=0x0000000114813e00, gf=0x0000000118420020, n_threads=12, threadpool=0x0000600003b6c3c0) at llama.cpp:14412:5
    frame #8: 0x0000000104c2b148 llama-bench`llama_decode_internal(lctx=0x0000000114813e00, batch_all=llama_batch @ 0x000000016b28ac60) at llama.cpp:14666:9
    frame #9: 0x0000000104c2a15c llama-bench`llama_decode(ctx=0x0000000114813e00, batch=llama_batch @ 0x000000016b28ad08) at llama.cpp:18489:21
    frame #10: 0x0000000104f3ecbc llama-bench`test_prompt(ctx=0x0000000114813e00, n_prompt=32, n_past=0, n_batch=2048, n_threads=12) at llama-bench.cpp:1319:9
    frame #11: 0x0000000104f3ae44 llama-bench`main(argc=9, argv=0x000000016b28b940) at llama-bench.cpp:1454:13
    frame #12: 0x000000018fbae0e0 dyld`start + 2360
  thread #2
    frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x000000012481fe20) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x000000012481fe20) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #3
    frame #0: 0x0000000104baf690 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3019:54
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124820040) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820040) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #4
    frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124820260) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820260) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #5
    frame #0: 0x0000000104baf690 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3019:54
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124820480) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820480) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #6
    frame #0: 0x0000000104baf690 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3019:54
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x00000001248206a0) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001248206a0) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #7
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001248208c0) at ggml.c:19154:17
    frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001248208c0) at ggml.c:19189:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #8
    frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124820ae0) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820ae0) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #9
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000124820d00) at ggml.c:19154:17
    frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820d00) at ggml.c:19189:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #10
    frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124820f20) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820f20) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #11
    frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124821140) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124821140) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #12
    frame #0: 0x0000000104baf690 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3019:54
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124821360) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124821360) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #13
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000124821580) at ggml.c:19154:17
    frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124821580) at ggml.c:19189:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #14
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001248217a0) at ggml.c:19154:17
    frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001248217a0) at ggml.c:19189:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #15
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001248219c0) at ggml.c:19154:17
    frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001248219c0) at ggml.c:19189:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #16
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000124821be0) at ggml.c:19154:17
    frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124821be0) at ggml.c:19189:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #17
    frame #0: 0x000000018fef7ea4 libsystem_kernel.dylib`__workq_kernreturn + 8

Jul 26 '24 18:07 slaren

I tried to test this on macOS, but it seems to deadlock.

Fixed!

Jul 26 '24 20:07 fmz

On M2 Max: (GGML_NO_METAL=1 GGML_NO_ACCELERATE=1)

Model	Threads	Test	t/s master	t/s threadpool	Speedup
llama 7B Q4_0	4	pp512	32.97	34.87	1.06
llama 7B Q4_0	4	tg128	18.01	18.37	1.02
llama 7B Q4_0	6	pp512	47.43	48.99	1.03
llama 7B Q4_0	6	tg128	23.10	23.32	1.01
llama 7B Q4_0	8	pp512	49.90	55.17	1.11
llama 7B Q4_0	8	tg128	18.09	21.98	1.22
llama 7B Q4_0	10	pp512	52.50	56.69	1.08
llama 7B Q4_0	10	tg128	14.24	8.54	0.60
llama 7B Q4_0	12	pp512	56.37	56.93	1.01
llama 7B Q4_0	12	tg128	5.02	9.44	1.88

Jul 26 '24 20:07 fmz

Same thing, but with llama-v3 8B Q4_0_4_4 (for some reason my compiler AppleClang15 doesn't support INT8 matmul?)

Model	Threads	Test	t/s master	t/s threadpool	Speedup
llama 8B Q4_0_4_4	4	pp512	72.44	72.83	1.01
llama 8B Q4_0_4_4	4	tg128	22.29	23.50	1.05
llama 8B Q4_0_4_4	6	pp512	98.71	100.21	1.02
llama 8B Q4_0_4_4	6	tg128	24.63	24.44	0.99
llama 8B Q4_0_4_4	8	pp512	95.86	116.17	1.21
llama 8B Q4_0_4_4	8	tg128	21.19	26.28	1.24
llama 8B Q4_0_4_4	10	pp512	102.37	105.18	1.03
llama 8B Q4_0_4_4	10	tg128	18.63	16.98	0.91
llama 8B Q4_0_4_4	12	pp512	108.08	101.18	0.94
llama 8B Q4_0_4_4	12	tg128	6.22	11.39	1.83

Jul 26 '24 21:07 fmz

ref: original PR #7526

Added an API to support explicit management and fine-grain control of threadpools. The API supports creating different threadpools for various parts of execution, e.g. batch, single-token, etc. Each threadpool can be created, paused, resumed, and released independently from any other threadpools. This mitigates the overhead of starting/stopping threads for each decode call and helps OSes keep track of scheduling history in order to make better scheduling decisions.

Each threadpool supports:

Setting number of threads (duh) Setting a CPU mask for threads to be placed on Support for strict/relaxed placement: pinning specific threads to specific cores, or letting the OS decide Support for polling/interrupt-driven wait Setting thread priority Using threadpools explicitly is optional. If a llama_decode is called with a llama_context that doesn't have a threadpool attached, a disposable threadpool is created (same as the current behavior). If users choose to explicitly use threadpools, they have to manage them manually. See example in main.cpp.

With all the bells and whistles enabled, we generally see a minor improvement vs OMP. Without polling, threadpool runs on par with OMP.

[x] I have read the contributing guidelines

Self-reported review complexity:

[ ] Low

[x] Medium

[ ] High

If it crashes, can the error message include "deadpool"?

Jul 26 '24 21:07 oldgithubman

@slaren lmk if it works for you this time

Jul 29 '24 13:07 fmz

I tested this again on the M3 Max, but it still seems to deadlock. These are the call stacks of the threads:

(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fc00) at ggml.c:19133:5
    frame #2: 0x0000000102383190 llama-bench`ggml_graph_compute(cgraph=0x00000001085f81c8, cplan=0x000000016daa2730) at ggml.c:19374:5
    frame #3: 0x00000001023c3394 llama-bench`ggml_backend_cpu_graph_compute(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:822:12
    frame #4: 0x00000001023bd840 llama-bench`ggml_backend_graph_compute_async(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:282:12
    frame #5: 0x00000001023c1864 llama-bench`ggml_backend_sched_compute_splits(sched=0x000000010680c400) at ggml-backend.c:1800:35
    frame #6: 0x00000001023c15a0 llama-bench`ggml_backend_sched_graph_compute_async(sched=0x000000010680c400, graph=0x00000001081c0020) at ggml-backend.c:1987:12
    frame #7: 0x00000001024e0b58 llama-bench`llama_graph_compute(lctx=0x000000010680fa00, gf=0x00000001081c0020, n_threads=12, threadpool=0x00006000027e43c0) at llama.cpp:14425:5
    frame #8: 0x0000000102404938 llama-bench`llama_decode_internal(lctx=0x000000010680fa00, batch_all=llama_batch @ 0x000000016daa2c60) at llama.cpp:14679:9
    frame #9: 0x0000000102403a9c llama-bench`llama_decode(ctx=0x000000010680fa00, batch=llama_batch @ 0x000000016daa2d08) at llama.cpp:18499:21
    frame #10: 0x0000000102712eac llama-bench`test_prompt(ctx=0x000000010680fa00, n_prompt=32, n_past=0, n_batch=2048, n_threads=12) at llama-bench.cpp:1319:9
    frame #11: 0x000000010270f0b8 llama-bench`main(argc=9, argv=0x000000016daa3940) at llama-bench.cpp:1454:13
    frame #12: 0x000000018fbae0e0 dyld`start + 2360
  thread #2
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fe20) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x000000013681fe20) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #3
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820040) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820040) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #4
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820260) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820260) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #5
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820480) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820480) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #6
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368206a0) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368206a0) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #7
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368208c0) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368208c0) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #8
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820ae0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820ae0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #9
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820d00) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820d00) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #10
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820f20) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820f20) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #11
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821140) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821140) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #12
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821360) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821360) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #13
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821580) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821580) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #14
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368217a0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368217a0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #15
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368219c0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368219c0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #16
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821be0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821be0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #17
    frame #0: 0x000000018fef7ea4 libsystem_kernel.dylib`__workq_kernreturn + 8

Built with LLAMA_DEBUG=1 GGML_NO_METAL=1 make llama-bench && ./llama-bench -m models/llama-2-7b/ggml-model-Q4_0.gguf -n 0 -r 1 -p 32.

Jul 31 '24 14:07 slaren

I tested this again on the M3 Max, but it still seems to deadlock. These are the call stacks of the threads:

(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fc00) at ggml.c:19133:5
    frame #2: 0x0000000102383190 llama-bench`ggml_graph_compute(cgraph=0x00000001085f81c8, cplan=0x000000016daa2730) at ggml.c:19374:5
    frame #3: 0x00000001023c3394 llama-bench`ggml_backend_cpu_graph_compute(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:822:12
    frame #4: 0x00000001023bd840 llama-bench`ggml_backend_graph_compute_async(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:282:12
    frame #5: 0x00000001023c1864 llama-bench`ggml_backend_sched_compute_splits(sched=0x000000010680c400) at ggml-backend.c:1800:35
    frame #6: 0x00000001023c15a0 llama-bench`ggml_backend_sched_graph_compute_async(sched=0x000000010680c400, graph=0x00000001081c0020) at ggml-backend.c:1987:12
    frame #7: 0x00000001024e0b58 llama-bench`llama_graph_compute(lctx=0x000000010680fa00, gf=0x00000001081c0020, n_threads=12, threadpool=0x00006000027e43c0) at llama.cpp:14425:5
    frame #8: 0x0000000102404938 llama-bench`llama_decode_internal(lctx=0x000000010680fa00, batch_all=llama_batch @ 0x000000016daa2c60) at llama.cpp:14679:9
    frame #9: 0x0000000102403a9c llama-bench`llama_decode(ctx=0x000000010680fa00, batch=llama_batch @ 0x000000016daa2d08) at llama.cpp:18499:21
    frame #10: 0x0000000102712eac llama-bench`test_prompt(ctx=0x000000010680fa00, n_prompt=32, n_past=0, n_batch=2048, n_threads=12) at llama-bench.cpp:1319:9
    frame #11: 0x000000010270f0b8 llama-bench`main(argc=9, argv=0x000000016daa3940) at llama-bench.cpp:1454:13
    frame #12: 0x000000018fbae0e0 dyld`start + 2360
  thread #2
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fe20) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x000000013681fe20) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #3
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820040) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820040) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #4
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820260) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820260) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #5
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820480) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820480) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #6
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368206a0) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368206a0) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #7
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368208c0) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368208c0) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #8
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820ae0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820ae0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #9
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820d00) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820d00) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #10
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820f20) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820f20) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #11
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821140) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821140) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #12
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821360) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821360) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #13
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821580) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821580) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #14
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368217a0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368217a0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #15
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368219c0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368219c0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #16
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821be0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821be0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #17
    frame #0: 0x000000018fef7ea4 libsystem_kernel.dylib`__workq_kernreturn + 8

Built with LLAMA_DEBUG=1 GGML_NO_METAL=1 make llama-bench && ./llama-bench -m models/llama-2-7b/ggml-model-Q4_0.gguf -n 0 -r 1 -p 32.

Bummer... Thanks for the details! Looks like we got some trouble in the "ACCELERATE" path I'll fix it asap

Jul 31 '24 15:07 fmz

I tested this again on the M3 Max, but it still seems to deadlock. These are the call stacks of the threads:

(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fc00) at ggml.c:19133:5
    frame #2: 0x0000000102383190 llama-bench`ggml_graph_compute(cgraph=0x00000001085f81c8, cplan=0x000000016daa2730) at ggml.c:19374:5
    frame #3: 0x00000001023c3394 llama-bench`ggml_backend_cpu_graph_compute(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:822:12
    frame #4: 0x00000001023bd840 llama-bench`ggml_backend_graph_compute_async(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:282:12
    frame #5: 0x00000001023c1864 llama-bench`ggml_backend_sched_compute_splits(sched=0x000000010680c400) at ggml-backend.c:1800:35
    frame #6: 0x00000001023c15a0 llama-bench`ggml_backend_sched_graph_compute_async(sched=0x000000010680c400, graph=0x00000001081c0020) at ggml-backend.c:1987:12
    frame #7: 0x00000001024e0b58 llama-bench`llama_graph_compute(lctx=0x000000010680fa00, gf=0x00000001081c0020, n_threads=12, threadpool=0x00006000027e43c0) at llama.cpp:14425:5
    frame #8: 0x0000000102404938 llama-bench`llama_decode_internal(lctx=0x000000010680fa00, batch_all=llama_batch @ 0x000000016daa2c60) at llama.cpp:14679:9
    frame #9: 0x0000000102403a9c llama-bench`llama_decode(ctx=0x000000010680fa00, batch=llama_batch @ 0x000000016daa2d08) at llama.cpp:18499:21
    frame #10: 0x0000000102712eac llama-bench`test_prompt(ctx=0x000000010680fa00, n_prompt=32, n_past=0, n_batch=2048, n_threads=12) at llama-bench.cpp:1319:9
    frame #11: 0x000000010270f0b8 llama-bench`main(argc=9, argv=0x000000016daa3940) at llama-bench.cpp:1454:13
    frame #12: 0x000000018fbae0e0 dyld`start + 2360
  thread #2
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fe20) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x000000013681fe20) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #3
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820040) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820040) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #4
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820260) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820260) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #5
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820480) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820480) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #6
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368206a0) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368206a0) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #7
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368208c0) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368208c0) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #8
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820ae0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820ae0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #9
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820d00) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820d00) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #10
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820f20) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820f20) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #11
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821140) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821140) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #12
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821360) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821360) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #13
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821580) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821580) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #14
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368217a0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368217a0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #15
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368219c0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368219c0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #16
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821be0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821be0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #17
    frame #0: 0x000000018fef7ea4 libsystem_kernel.dylib`__workq_kernreturn + 8

Built with LLAMA_DEBUG=1 GGML_NO_METAL=1 make llama-bench && ./llama-bench -m models/llama-2-7b/ggml-model-Q4_0.gguf -n 0 -r 1 -p 32.

Bummer... Thanks for the details! Looks like we got some trouble in the "ACCELERATE" path I'll fix it asap

@slaren turns out there was a bit of a corner case where if you have a graph with only 1 node, ggml_barrier and wait_for_work deadlock on each other. Added a check to handle that specific case

Jul 31 '24 16:07 fmz

Thanks, I was able to run it now. Unfortunately the results are still not very good on my system. Under WSL this threadpool is much slower than OpenMP. A threadpool would be more important on macOS, since OpenMP is not available there, but for me it is also slower on the M3 Max.

M3 Max: GGML_NO_METAL=1 scripts/compare-commits.sh master threadpool -m models/llama-2-7b/ggml-model-Q4_0.gguf

CPU	Model	Model Size [GiB]	Test	t/s master	t/s threadpool	Speedup
M3 Max	llama 7B Q4_0	3.56	pp512	151.21	149.88	0.99
M3 Max	llama 7B Q4_0	3.56	tg128	30.06	26.09	0.87

13900k + 3090Ti: OpenMP (GGML_CUDA=1 make llama-bench && ./llama-bench -nkvo 0,1)

model	size	params	backend	ngl	nkvo	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	5699.53 ± 19.73
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	150.75 ± 1.23
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	651.63 ± 32.31
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	63.85 ± 3.22

Threadpool (GGML_CUDA=1 GGML_NO_OPENMP=1 make llama-bench && ./llama-bench -nkvo 0,1)

model	size	params	backend	ngl	nkvo	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	5453.33 ± 216.72
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	144.45 ± 0.98
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	566.43 ± 27.64
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	29.54 ± 0.99

build: bebe99c2 (3500)

Aug 01 '24 17:08 slaren

Thanks, I was able to run it now. Unfortunately the results are still not very good on my system. Under WSL this threadpool is much slower than OpenMP. A threadpool would be more important on macOS, since OpenMP is not available there, but for me it is also slower on the M3 Max.

M3 Max: GGML_NO_METAL=1 scripts/compare-commits.sh master threadpool -m models/llama-2-7b/ggml-model-Q4_0.gguf

CPU Model Model Size [GiB] Test t/s master t/s threadpool Speedup M3 Max llama 7B Q4_0 3.56 pp512 151.21 149.88 0.99 M3 Max llama 7B Q4_0 3.56 tg128 30.06 26.09 0.87 13900k + 3090Ti: OpenMP (GGML_CUDA=1 make llama-bench && ./llama-bench -nkvo 0,1)

model size params backend ngl nkvo test t/s llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 5699.53 ± 19.73 llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 150.75 ± 1.23 llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 651.63 ± 32.31 llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 63.85 ± 3.22 Threadpool (GGML_CUDA=1 GGML_NO_OPENMP=1 make llama-bench && ./llama-bench -nkvo 0,1)

model size params backend ngl nkvo test t/s llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 5453.33 ± 216.72 llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 144.45 ± 0.98 llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 566.43 ± 27.64 llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 29.54 ± 0.99 build: bebe99c (3500)

Ooof... That is quite a bit slower. I'll try to replicate this locally

Aug 01 '24 17:08 fmz

@fmz @slaren I fixed one of the issues that was causing regressions. We were setting the default number of threads in the threadpool using std::thread::hardware_concurrency(). I updated that to use cpu_get_num_math() this way we are going to exclude E-Cores and Hypethreading siblings. This is what was causing regressions with the default cmd line args where the number of threads is not explicitly specified. We were starting 12 threads on M2 Max, where only 8 cores are really usable, same on AMD EPYC (using siblings) and Intel 13/14th Gen (using E-Cores).

I'm also working on another fix which is specific to llama-bench. Currently (in the threadpool branch) we start a single threadpool with max-num-threads and reuse it for each test. Suppose the test is using 4 threads but we'd start 12 (on M2 Max or Snapdragon X-Elite). This is suboptimal because the spinning threads interfere with Core boosting and things. It's better to start a fresh threadpool for each test.

Aug 03 '24 23:08 max-krasnyansky

@fmz @slaren llama-bench has been updated as I described above.

Here are the numbers from M2 Max. I'll share numbers for an AMD EPYC server, Snapdragon X-Elite and Gen-3 a bit later.

CC=clang CXX=clang++ CFLAGS="-march=armv8.7-a" CXXFLAGS="-march=armv8.7-a" GGML_NO_METAL=1 GGML_NO_ACCELERATE=1 \
    ./scripts/compare-commits.sh master threadpool -m ../gguf/llama-v3.1.q4_0_4_8.gguf -ngl 0 -t 4,6,8
...
+ ./scripts/compare-llama-bench.py -b master -c threadpool
| CPU   | Model             |   Threads | Test   |   t/s master |   t/s threadpool |   Speedup |
|:------|:------------------|----------:|:-------|-------------:|-----------------:|----------:|
|       | llama 8B Q4_0_4_8 |         4 | pp512  |        64.43 |            64.52 |      1.00 |
|       | llama 8B Q4_0_4_8 |         4 | tg128  |        22.53 |            24.36 |      1.08 |
|       | llama 8B Q4_0_4_8 |         6 | pp512  |        89.79 |            91.04 |      1.01 |
|       | llama 8B Q4_0_4_8 |         6 | tg128  |        24.73 |            26.21 |      1.06 |
|       | llama 8B Q4_0_4_8 |         8 | pp512  |       117.14 |           118.67 |      1.01 |
|       | llama 8B Q4_0_4_8 |         8 | tg128  |        26.11 |            26.37 |      1.01 |

Aug 04 '24 01:08 max-krasnyansky

The performance looks better now, with nkvo it is comparable to OpenMP, which is very good. There is still a performance drop when using the BLAS backend (this includes the default build in macOS, which uses Accelerate). I suspect that this is because the threads are spinning in ggml_graph_compute_check_for_work while the BLAS backend is running. This will also cause the threads to spin while the GPU backend is running when partially offloading, which would be a reggression. Rather than requiring the user to disable polling manually, I suggest implementing some kind of backoff and yield the threads after spinning for a while. The BLAS backend (ggml-blas.cpp) may also benefit from using the threadpool, since it launches several threads to dequantize the weights, and it could also automatically pause the pool during the call to the BLAS library.

Results

GGML_CUDA=1 make llama-bench > /dev/null && ./llama-bench -nkvo 0,1 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	nkvo	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	5689.32 ± 13.35
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	154.53 ± 1.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	643.28 ± 31.69
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	64.27 ± 2.21

build: 267bf570 (3554)

GGML_CUDA=1 GGML_NO_OPENMP=1 make llama-bench > /dev/null && ./llama-bench -nkvo 0,1 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	nkvo	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	5674.51 ± 37.77
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	153.30 ± 0.48
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	646.42 ± 32.41
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	62.98 ± 2.94

build: 267bf570 (3554)

GGML_BLIS=1 make llama-bench > /dev/null && ./llama-bench -p 128 -n 32

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	BLAS	16	pp128	47.55 ± 0.17
llama 7B Q4_0	3.56 GiB	6.74 B	BLAS	16	tg32	20.79 ± 0.10

build: 267bf570 (3554)

GGML_BLIS=1 GGML_NO_OPENMP=1 make llama-bench > /dev/null && ./llama-bench -p 128 -n 32

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	BLAS	16	pp128	33.47 ± 0.48
llama 7B Q4_0	3.56 GiB	6.74 B	BLAS	16	tg32	20.58 ± 0.07

build: 267bf570 (3554)

CPU	Model	Threads	Test	t/s master	t/s threadpool	Speedup
M3 Max	llama 7B all F32	4	pp512	150.03	134.77	0.90
M3 Max	llama 7B all F32	4	tg128	4.76	4.20	0.88
M3 Max	llama 7B all F32	8	pp512	155.66	115.40	0.74
M3 Max	llama 7B all F32	8	tg128	4.76	4.35	0.91
M3 Max	llama 7B all F32	12	pp512	156.19	94.43	0.60
M3 Max	llama 7B all F32	12	tg128	4.66	4.33	0.93
M3 Max	llama 7B Q4_0	4	pp512	142.43	144.89	1.02
M3 Max	llama 7B Q4_0	4	tg128	21.04	20.74	0.99
M3 Max	llama 7B Q4_0	8	pp512	150.08	142.22	0.95
M3 Max	llama 7B Q4_0	8	tg128	28.22	28.14	1.00
M3 Max	llama 7B Q4_0	12	pp512	150.55	120.62	0.80
M3 Max	llama 7B Q4_0	12	tg128	30.10	30.26	1.01
M3 Max	stories260K	4	pp512	52491.62	65492.68	1.25
M3 Max	stories260K	4	tg128	8417.80	12262.68	1.46
M3 Max	stories260K	8	pp512	59893.07	94300.47	1.57
M3 Max	stories260K	8	tg128	3746.70	5639.87	1.51
M3 Max	stories260K	12	pp512	53756.90	115958.90	2.16
M3 Max	stories260K	12	tg128	2507.28	4333.34	1.73

Aug 08 '24 22:08 slaren

The performance looks better now, with nkvo it is comparable to OpenMP, which is very good. There is still a performance drop when using the BLAS backend (this includes the default build in macOS, which uses Accelerate). I suspect that this is because the threads are spinning in ggml_graph_compute_check_for_work while the BLAS backend is running. This will also cause the threads to spin while the GPU backend is running when partially offloading, which would be a reggression. Rather than requiring the user to disable polling manually, I suggest implementing some kind of backoff and yield the threads after spinning for a while. The BLAS backend (ggml-blas.cpp) may also benefit from using the threadpool, since it launches several threads to dequantize the weights, and it could also automatically pause the pool during the call to the BLAS library.

@slaren

Awesome! Thanks for checking out the latest. We've been doing lots of profiling and tuning. Every time I'm about to send an updated perf report on Snapdragons and M2 I find yet another thing to improve :) In my testing we're doing really well with the CPU backend (especially on the ARM64-based systems), with other backends, as you pointed out, the spinning threads get in the way at times and cause regressions. I'll try your suggestions.

btw We might just flip the default back to non-polling. Technically polling is only useful for the llama-bench to match OpenMP behavior/numbers in that case. When I looked at the original profiles, I saw that the threadpool is doing a lot more context switches than OpenMP during token-gen test. Polling removes those context switches and we get even better numbers now. It might make sense to make that a bit of a special case (ie default to polling for the CPU backend bench, otherwise default is non-polling) or some hybrid approach as you suggested.

Aug 09 '24 00:08 max-krasnyansky

@slaren @fmz

I managed to further improve the threadpool signaling (reducing the number of wake-ups, etc) and also introduced the hybrid polling mode which is now the default. --poll now sets the polling level which is basically how aggressively we poll. 0 means no polling, 1 means around 128K polling rounds then cond.wait, 2x128K rounds, etc. The default is 50 which seems to work well on the machines I got here (see the report).

The regression with the Metal backend should be fixed now (see the report below).

The BLIS backend will need some further tuning. Though I wonder how useful it is given how much slower it is compared to plain CPU backend with the latest CPU features.

I included latest llama-bench and few simple llama-cli results for M2 Max, AMD EPYC 7543, Snapdragon X-Elite and Snapgragon Gen 3 with LLama v3.1 8B and a smaller LLama-based 314M model (generated using https://arxiv.org/pdf/2403.00858)

Results

M2 Max (default build)

make clean; make llama-bench

~/src/llama.cpp-master$ ./llama-bench -m ../gguf/llama-v3.1.q4_0.gguf -p 128 -n 32

model	size	params	backend	ngl	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	Metal	99	pp128	469.51 ± 1.07
llama 8B Q4_0	4.33 GiB	8.03 B	Metal	99	tg32	58.92 ± 0.22

build: 3071c0a5 (3557)

~/src/llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v3.1.q4_0.gguf -p 128 -n 32

model	size	params	backend	ngl	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	Metal	99	pp128	470.04 ± 0.25
llama 8B Q4_0	4.33 GiB	8.03 B	Metal	99	tg32	58.78 ± 0.18

build: 323181f2 (3573)

M2 Max (llvm build to enable MATMUL_INT8)

CC=clang CXX=clang++ CFLAGS="-march=armv8.7-a" CXXFLAGS="-march=armv8.7-a" GGML_NO_METAL=1 GGML_NO_ACCELERATE=1 make -j32 llama-bench llama-cli

Note Q4_0_4_X is broken with ACCELERATE, but BLIS and ACCELERATE are much slower anyway *

~/src/llama.cpp-master$ ./llama-bench -m ../gguf/llama-v3.1.q4_0_4_8.gguf -t 4,6,8

model	size	params	backend	threads	test	t/s
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	4	pp512	63.69 ± 0.37
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	4	tg128	22.51 ± 0.11
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	6	pp512	90.60 ± 1.19
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	6	tg128	24.73 ± 0.06
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	8	pp512	112.72 ± 2.76
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	8	tg128	25.21 ± 0.86

build: 3071c0a5 (3557)

~/src/llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v3.1.q4_0_4_8.gguf -t 4,6,8

model	size	params	backend	threads	test	t/s
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	4	pp512	65.59 ± 0.70
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	4	tg128	24.80 ± 0.13
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	6	pp512	92.72 ± 1.75
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	6	tg128	26.12 ± 0.12
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	8	pp512	116.72 ± 1.33
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	8	tg128	26.34 ± 0.05

build: 323181f2 (3573)

~/src/llama.cpp-master$ ./llama-bench -m ../gguf/llama-v3-314m.q4_0_4_8.gguf -t 4,6,8

model	size	params	backend	threads	test	t/s
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	4	pp512	6216.28 ± 206.77
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	4	tg128	405.77 ± 0.85
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	6	pp512	9223.23 ± 105.64
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	6	tg128	544.30 ± 0.48
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	8	pp512	12073.97 ± 76.67
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	8	tg128	616.44 ± 1.81

build: 3071c0a5 (3557)

~/src/llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v3-314m.q4_0_4_8.gguf -t 4,6,8

model	size	params	backend	threads	test	t/s
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	4	pp512	6680.06 ± 70.44
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	4	tg128	503.61 ± 1.44
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	6	pp512	9431.40 ± 13.32
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	6	tg128	638.16 ± 4.39
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	8	pp512	12292.80 ± 40.62
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	8	tg128	674.69 ± 12.01

build: 323181f2 (3573)

M2 Max (BLIS backend)

~/src/llama.cpp-master$ ./llama-bench -m ../gguf/llama-v3.1.q4_0.gguf -p 64 -n 16

model	size	params	backend	threads	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	BLAS	8	pp64	48.69 ± 0.08
llama 8B Q4_0	4.33 GiB	8.03 B	BLAS	8	tg16	24.68 ± 0.10

build: 3071c0a5 (3557)

~/src/llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v3.1.q4_0.gguf -p 64 -n 16

model	size	params	backend	threads	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	BLAS	8	pp64	35.90 ± 0.97
llama 8B Q4_0	4.33 GiB	8.03 B	BLAS	8	tg16	24.68 ± 0.32

build: 323181f2 (3573)

~/src/llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v3.1.q4_0.gguf -p 64 -n 16 --poll 0

model	size	params	backend	threads	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	BLAS	8	pp64	46.44 ± 2.06
llama 8B Q4_0	4.33 GiB	8.03 B	BLAS	8	tg16	24.81 ± 0.03

build: 323181f2 (3573)

AMD EPYC (default build)

Note Q4_K has the best perf on the EPYC *

make -j32 llama-bench

llama.cpp-master$ ./llama-bench -m ../gguf/llama-v3.1-8B.q4_k_s-pure.gguf -t 16,32,64 -p 64 -n 16

model	size	params	backend	threads	test	t/s
llama 8B Q4_K - Small	4.21 GiB	8.03 B	CPU	16	pp64	63.11 ± 0.14
llama 8B Q4_K - Small	4.21 GiB	8.03 B	CPU	16	tg16	18.82 ± 0.88
llama 8B Q4_K - Small	4.21 GiB	8.03 B	CPU	32	pp64	116.73 ± 3.24
llama 8B Q4_K - Small	4.21 GiB	8.03 B	CPU	32	tg16	22.86 ± 0.71
llama 8B Q4_K - Small	4.21 GiB	8.03 B	CPU	64	pp64	141.81 ± 0.55
llama 8B Q4_K - Small	4.21 GiB	8.03 B	CPU	64	tg16	21.13 ± 0.34

build: 3071c0a5 (3557)

GGML_NO_OPENMP=1 make -j32 llama-bench

llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v3.1-8B.q4_k_s-pure.gguf -t 16,32,64 -p 64 -n 16

model	size	params	backend	threads	test	t/s
llama 8B Q4_K - Small	4.21 GiB	8.03 B	CPU	16	pp64	62.82 ± 0.96
llama 8B Q4_K - Small	4.21 GiB	8.03 B	CPU	16	tg16	16.09 ± 0.28
llama 8B Q4_K - Small	4.21 GiB	8.03 B	CPU	32	pp64	110.30 ± 1.07
llama 8B Q4_K - Small	4.21 GiB	8.03 B	CPU	32	tg16	19.15 ± 0.82
llama 8B Q4_K - Small	4.21 GiB	8.03 B	CPU	64	pp64	122.25 ± 5.91
llama 8B Q4_K - Small	4.21 GiB	8.03 B	CPU	64	tg16	19.21 ± 0.47

build: 323181f2 (3573)

Real use-case does much better.

llama.cpp-master$ ./llama-cli -m ../gguf/llama-v3.1-8B.q4_k_s-pure.gguf -t 32 --seed 42 -p 'what is the most popular cookie in the world? (prease be brief)' -n 40
...
what is the most popular cookie in the world? (prease be brief) It is chocolate chip cookie. It is a classic favorite and one of the most beloved cookies.
What is the most popular cookie in the world?
According to various sources, including Google Trends and online baking
llama_print_timings:        load time =    4399.32 ms
llama_print_timings:      sample time =       3.48 ms /    40 runs   (    0.09 ms per token, 11490.95 tokens per second)
llama_print_timings: prompt eval time =     211.88 ms /    16 tokens (   13.24 ms per token,    75.51 tokens per second)
llama_print_timings:        eval time =    2267.88 ms /    39 runs   (   58.15 ms per token,    17.20 tokens per second)
llama_print_timings:       total time =    2489.99 ms /    55 tokens

llama.cpp-threadpool$ ./llama-cli -m ../gguf/llama-v3.1-8B.q4_k_s-pure.gguf -t 32 --seed 42 -p 'what is the most popular cookie in the world? (prease be brief)' -n 40
...
what is the most popular cookie in the world? (prease be brief) It is chocolate chip cookie. It is a classic favorite and one of the most beloved cookies.
What is the most popular cookie in the world?
According to various sources, including Google Trends and online baking
llama_print_timings:        load time =    4250.79 ms
llama_print_timings:      sample time =       2.92 ms /    40 runs   (    0.07 ms per token, 13708.02 tokens per second)
llama_print_timings: prompt eval time =     203.73 ms /    16 tokens (   12.73 ms per token,    78.54 tokens per second)
llama_print_timings:        eval time =    2072.78 ms /    39 runs   (   53.15 ms per token,    18.82 tokens per second)
llama_print_timings:       total time =    2285.96 ms /    55 tokens

AMD EPYC (BLIS backend)

llama.cpp-master$ ./llama-bench -m ../gguf/llama-v3.1-8B.q4_k_s-pure.gguf -t 32 -p 64 -n 16

model	size	params	backend	threads	test	t/s
llama 8B Q4_K - Small	4.21 GiB	8.03 B	BLAS	32	pp64	29.85 ± 0.25
llama 8B Q4_K - Small	4.21 GiB	8.03 B	BLAS	32	tg16	19.71 ± 0.63

build: 3071c0a5 (3557)

llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v3.1-8B.q4_k_s-pure.gguf -t 32 -p 64 -n 16

model	size	params	backend	threads	test	t/s
llama 8B Q4_K - Small	4.21 GiB	8.03 B	BLAS	32	pp64	21.50 ± 1.53
llama 8B Q4_K - Small	4.21 GiB	8.03 B	BLAS	32	tg16	17.49 ± 0.17

Snapdragon X-Elite (default llvm-windows build)

llama.cpp-master ./build-arm64-windows-llvm-release/bin/llama-bench.exe -m ../gguf/llama-v3.1.q4_0_4_8.gguf -t 4,6,8,10,12

model	size	params	backend	threads	test	t/s
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	4	pp512	70.06 ± 0.06
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	4	tg128	21.20 ± 0.11
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	6	pp512	100.08 ± 1.78
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	6	tg128	21.84 ± 0.15
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	8	pp512	125.61 ± 1.52
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	8	tg128	19.30 ± 3.68
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	10	pp512	144.20 ± 5.02
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	10	tg128	22.60 ± 0.21
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	12	pp512	175.83 ± 5.59
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	12	tg128	9.83 ± 7.34

build: 3071c0a5 (3557)

~/src/llama.cpp-threadpool ./build-arm64-windows-llvm-release/bin/llama-bench.exe -m ../gguf/llama-v3.1.q4_0_4_8.gguf -t 4,6,8,10,12

model	size	params	backend	threads	test	t/s
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	4	pp512	70.38 ± 0.22
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	4	tg128	21.71 ± 0.17
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	6	pp512	100.66 ± 1.74
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	6	tg128	23.14 ± 0.20
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	8	pp512	126.16 ± 2.03
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	8	tg128	22.91 ± 0.09
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	10	pp512	151.48 ± 3.02
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	10	tg128	23.61 ± 0.22
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	12	pp512	185.21 ± 2.95
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	12	tg128	21.82 ± 1.44

build: 323181f2 (3573)

llama.cpp-master ./build-arm64-windows-llvm-release/bin/llama-bench.exe -m ../gguf/llama-v3-314m.q4_0_4_8.gguf -t 4,6,8,10,12

model	size	params	backend	threads	test	t/s
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	4	pp512	5352.91 ± 17.93
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	4	tg128	345.11 ± 1.41
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	6	pp512	7660.97 ± 77.94
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	6	tg128	405.47 ± 5.90
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	8	pp512	9699.85 ± 62.08
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	8	tg128	438.34 ± 5.46
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	10	pp512	11651.56 ± 158.28
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	10	tg128	436.98 ± 4.93
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	12	pp512	12893.53 ± 943.14
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	12	tg128	408.23 ± 7.20

build: 3071c0a5 (3557)

~/src/llama.cpp-threadpool ./build-arm64-windows-llvm-release/bin/llama-bench.exe -m ../gguf/llama-v3-314m.q4_0_4_8.gguf -t 4,6,8,10,12

model	size	params	backend	threads	test	t/s
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	4	pp512	5378.80 ± 8.04
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	4	tg128	360.03 ± 1.04
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	6	pp512	7747.20 ± 62.34
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	6	tg128	530.54 ± 4.81
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	8	pp512	9895.07 ± 49.93
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	8	tg128	626.93 ± 4.11
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	10	pp512	11836.86 ± 111.19
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	10	tg128	667.55 ± 6.24
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	12	pp512	13231.19 ± 524.42
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	12	tg128	653.97 ± 21.23

build: 323181f2 (3573)

llama.cpp-master

./build-arm64-windows-llvm-release/bin/llama-cli.exe --no-mmap -m ../gguf/llama-v3.1.q4_0_4_8.gguf -tb 10 -t 6 --ctx-size 2048 --seed 42 -p '<|begin_of_text|> what is the most popular cookie in the world? (please be brief)' -n 64
...
what is the most popular cookie in the world? (please be brief)
The chocolate chip cookie is the most popular cookie in the world. (according to various sources)
Note: This answer is brief because the question asks for a brief response.
Let me know if you'd like me to elaborate!
(If you'd like more information, I'd be happy to provide some
llama_print_timings:        load time =    1164.20 ms
llama_print_timings:      sample time =       3.58 ms /    64 runs   (    0.06 ms per token, 17892.09 tokens per second)
llama_print_timings: prompt eval time =      90.19 ms /    16 tokens (    5.64 ms per token,   177.41 tokens per second)
llama_print_timings:        eval time =    3038.12 ms /    63 runs   (   48.22 ms per token,    20.74 tokens per second)
llama_print_timings:       total time =    3141.14 ms /    79 tokens

llama.cpp-threadpool

./build-arm64-windows-llvm-release/bin/llama-cli.exe --no-mmap -m ../gguf/llama-v3.1.q4_0_4_8.gguf -tb 10 -t 6 --ctx-size 2048 --seed 42 -p '<|begin_of_text|> what is the most popular cookie in the world? (please be brief)' -n 64
...
 what is the most popular cookie in the world? (please be brief)
The chocolate chip cookie is the most popular cookie in the world. (according to various sources)
Note: This answer is brief because the question asks for a brief response.
Let me know if you'd like me to elaborate!
(If you'd like more information, I'd be happy to provide some
llama_print_timings:        load time =    1168.97 ms
llama_print_timings:      sample time =       3.74 ms /    64 runs   (    0.06 ms per token, 17103.15 tokens per second)
llama_print_timings: prompt eval time =      87.27 ms /    16 tokens (    5.45 ms per token,   183.33 tokens per second)
llama_print_timings:        eval time =    2940.82 ms /    63 runs   (   46.68 ms per token,    21.42 tokens per second)
llama_print_timings:       total time =    3042.01 ms /    79 tokens

Snapdragon Gen 3 (Galaxy S24 Ultra)

Default Android NDK build using the following CMake preset

    {
        "name": "arm64-android",
        "cacheVariables": {     
            "ANDROID_ABI":      "arm64-v8a",
            "ANDROID_PLATFORM": "android-31",
            "CMAKE_TOOLCHAIN_FILE": "$env{NDK}/build/cmake/android.toolchain.cmake",
            "CMAKE_C_FLAGS":   "-march=armv8.7a -fvectorize -ffp-model=fast -fno-finite-math-only -flto -D_GNU_SOURCE",
            "CMAKE_CXX_FLAGS": "-march=armv8.7a -fvectorize -ffp-model=fast -fno-finite-math-only -flto -D_GNU_SOURCE",
            "CMAKE_C_FLAGS_RELEASE":   "-O3 -DNDEBUG",
            "CMAKE_CXX_FLAGS_RELEASE": "-O3 -DNDEBUG"
        }
    }

threadpool branch is built with -D GGML_OPENMP=OFF

adb shell "cd /data/local/tmp/lmcp; ./run-bench.sh master llama-v3.1.q4_0_4_8.gguf -t 4,6 -p 32 -n 16"
export 'LD_LIBRARY_PATH=/data/local/tmp/lmcp/master'
taskset fc ./master/llama-bench --mmap 0 --n-gpu-layers 0 -m llama-v3.1.q4_0_4_8.gguf -t 4,6 -p 32 -n 16

model	size	params	backend	threads	test	t/s
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	4	pp32	40.43 ± 1.08
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	4	tg16	10.29 ± 0.13
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	6	pp32	48.27 ± 0.24
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	6	tg16	10.23 ± 0.13

adb shell "cd /data/local/tmp/lmcp; ./run-bench.sh threadpool llama-v3.1.q4_0_4_8.gguf -t 4,6 -p 32 -n 16"`
export 'LD_LIBRARY_PATH=/data/local/tmp/lmcp/threadpool'
taskset fc ./threadpool/llama-bench --mmap 0 --n-gpu-layers 0 -m llama-v3.1.q4_0_4_8.gguf -t 4,6 -p 32 -n 16

model	size	params	backend	threads	test	t/s
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	4	pp32	40.53 ± 0.77
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	4	tg16	10.59 ± 0.16
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	6	pp32	48.69 ± 0.31
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	6	tg16	10.40 ± 0.13

adb shell "cd /data/local/tmp/lmcp; ./run-bench.sh master llama-v3-314m.q4_0_4_8.gguf -t 4,6 -p 32 -n 16"`
export 'LD_LIBRARY_PATH=/data/local/tmp/lmcp/master'
taskset fc ./master/llama-bench --mmap 0 --n-gpu-layers 0 -m llama-v3-314m.q4_0_4_8.gguf -t 4,6 -p 32 -n 16

model	size	params	backend	threads	test	t/s
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	4	pp32	4006.52 ± 72.66
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	4	tg16	303.58 ± 5.96
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	6	pp32	5065.32 ± 82.08
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	6	tg16	302.77 ± 20.62

adb shell "cd /data/local/tmp/lmcp; ./run-bench.sh threadpool llama-v3-314m.q4_0_4_8.gguf -t 4,6 -p 32 -n 16"`
export 'LD_LIBRARY_PATH=/data/local/tmp/lmcp/threadpool'
taskset fc ./threadpool/llama-bench --mmap 0 --n-gpu-layers 0 -m llama-v3-314m.q4_0_4_8.gguf -t 4,6 -p 32 -n 16

model	size	params	backend	threads	test	t/s
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	4	pp32	4097.12 ± 22.80
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	4	tg16	312.81 ± 1.88
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	6	pp32	5127.89 ± 35.28
llama ?B Q4_0_4_8	200.79 MiB	314.06 M	CPU	6	tg16	337.77 ± 6.57

Aug 12 '24 06:08 max-krasnyansky

Edit: I totally forgot that GGML_OPENMP is disabled only for cmake builds... So the numbers below are openmp only. (interesting that there is any change at all...)

@slaren @max-krasnyansky latest CUDA numbers:

Stories260K: $ ./scripts/compare-llama-bench.py -b master -c threadpool

GPU	Model	NKVO	Test	t/s master	t/s threadpool	Speedup
RTX 3060 Laptop GPU	llama ?B all F32 (guessed)	No	pp512	199949.37	199425.82	1.00
RTX 3060 Laptop GPU	llama ?B all F32 (guessed)	No	tg128	2472.27	2585.31	1.05
RTX 3060 Laptop GPU	llama ?B all F32 (guessed)	Yes	pp512	12503.69	12627.24	1.01
RTX 3060 Laptop GPU	llama ?B all F32 (guessed)	Yes	tg128	1632.84	1642.70	1.01

llamav2 7B:

GPU	Model	NKVO	Test	t/s master	t/s threadpool	Speedup
RTX 3060 Laptop GPU	llama 7B Q4_0	No	pp512	1654.38	1658.14	1.00
RTX 3060 Laptop GPU	llama 7B Q4_0	No	tg128	66.71	66.82	1.00
RTX 3060 Laptop GPU	llama 7B Q4_0	Yes	pp512	288.97	295.97	1.02
RTX 3060 Laptop GPU	llama 7B Q4_0	Yes	tg128	54.52	54.90	1.01

Aug 12 '24 21:08 fmz

This is with openmp disabled on the threadpool branch:

$ LLAMA_CUDA=1 ./scripts/compare-commits.sh master threadpool -nkvo 0,1 -m models/7B/llama7b.gguf

GPU	Model	NKVO	Test	t/s master	t/s threadpool	Speedup
RTX 3060 Laptop GPU	llama 7B Q4_0	No	pp512	1657.23	1659.78	1.00
RTX 3060 Laptop GPU	llama 7B Q4_0	No	tg128	66.77	66.24	0.99
RTX 3060 Laptop GPU	llama 7B Q4_0	Yes	pp512	302.17	301.35	1.00
RTX 3060 Laptop GPU	llama 7B Q4_0	Yes	tg128	55.02	54.87	1.00

Aug 12 '24 23:08 fmz

can confirm it's slightly worse on stories 260K:

GPU	Model	NKVO	Test	t/s master	t/s threadpool	Speedup
RTX 3060 Laptop GPU	llama ?B all F32 (guessed)	No	pp512	199562.93	193086.43	0.97
RTX 3060 Laptop GPU	llama ?B all F32 (guessed)	No	tg128	2517.85	2399.51	0.95
RTX 3060 Laptop GPU	llama ?B all F32 (guessed)	Yes	pp512	12702.00	12819.76	1.01
RTX 3060 Laptop GPU	llama ?B all F32 (guessed)	Yes	tg128	1646.64	1628.08	0.99

Aug 13 '24 00:08 fmz

@slaren

@fmz and I worked on further improvements (removing special cases, reducing branches, etc) and at this point it seems like it should be good to merge.

I believe the BLAS/BLIS backend might need further work. I took a look at it and realized that ggml-blas.c wants a generic threadpool that executes arbitrary functions. The threadpool we've added so far is designed specifically for graph_compute. It's of course possible to update it and make it more generic, assuming there is interest in updating the BLAS/BLIS backend. From my testing it seems to be generally much slower, so I'm not sure how much we want to invest into it. Perhaps, we can just add a check in the make/cmake that BLAS backend requires OMP for now?

Perf numbers on the Snapgragons and the M2 are a bit better but overall similar to what I shared above, the perf profiles are looking cleaner though, things like total branches, missed branches, etc.

Here is a fresh run from Ryzen 9 3950X + RTX 3080 Ubuntu 22.04
Testing that nkvo scenario that you had regressions with before.

GGML_CUDA=1 GGML_NO_OPENMP=1 make -j16 llama-cli llama-bench
llama.cpp-master$ nice -20 ./llama-bench -m ../gguf/llama-v3.1.q4_0.gguf -ngl 0,99 -nkvo 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	nkvo	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	CUDA	0	0	pp512	1090.70 ± 22.64
llama 8B Q4_0	4.33 GiB	8.03 B	CUDA	0	0	tg128	10.01 ± 0.01
llama 8B Q4_0	4.33 GiB	8.03 B	CUDA	0	1	pp512	527.11 ± 0.31
llama 8B Q4_0	4.33 GiB	8.03 B	CUDA	0	1	tg128	10.03 ± 0.00
llama 8B Q4_0	4.33 GiB	8.03 B	CUDA	99	0	pp512	4384.25 ± 13.19
llama 8B Q4_0	4.33 GiB	8.03 B	CUDA	99	0	tg128	110.16 ± 0.12
llama 8B Q4_0	4.33 GiB	8.03 B	CUDA	99	1	pp512	798.64 ± 6.35
llama 8B Q4_0	4.33 GiB	8.03 B	CUDA	99	1	tg128	95.40 ± 0.15

build: 06943a69 (3581)

GGML_CUDA=1 GGML_NO_OPENMP=1 make -j16 llama-cli llama-bench
llama.cpp-threadpool$ nice -20 ./llama-bench -m ../gguf/llama-v3.1.q4_0.gguf -ngl 0,99 -nkvo 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	nkvo	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	CUDA	0	0	pp512	1114.06 ± 0.45
llama 8B Q4_0	4.33 GiB	8.03 B	CUDA	0	0	tg128	9.97 ± 0.02
llama 8B Q4_0	4.33 GiB	8.03 B	CUDA	0	1	pp512	534.33 ± 0.28
llama 8B Q4_0	4.33 GiB	8.03 B	CUDA	0	1	tg128	9.99 ± 0.01
llama 8B Q4_0	4.33 GiB	8.03 B	CUDA	99	0	pp512	4369.85 ± 8.77
llama 8B Q4_0	4.33 GiB	8.03 B	CUDA	99	0	tg128	109.93 ± 0.11
llama 8B Q4_0	4.33 GiB	8.03 B	CUDA	99	1	pp512	824.77 ± 6.40
llama 8B Q4_0	4.33 GiB	8.03 B	CUDA	99	1	tg128	96.29 ± 0.22

build: 9cd5a61d (3599)

I see that one of the server tests failed in the CI. I just ran the same thing locally and can't reproduce the failure. Will keep an eye on it.

Aug 14 '24 02:08 max-krasnyansky

The BLAS backend is still important at least in macOS because Accelerate is significantly faster. OpenMP is also not available in macOS. In my opinion this is the most important use case, because other platforms have access to good implementations of OpenMP.

I think I hit a deadlock when testing with LLAMA_DEBUG=1 GGML_BLIS=1 GGML_NO_OPENMP=1 GGML_NO_LLAMAFILE=1:

./llama-bench -m models/stories260K.gguf -r 10 -t 16

Note: 16 other threads idling in OpenMP (from BLIS) omitted .

Thread 16 (Thread 0x79f5dfbed6c0 (LWP 22390) "llama-bench"): #0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335 #1 __cpu_relax () at ggml/src/ggml.c:3060 #2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabf428) at ggml/src/ggml.c:19206 #4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabf428) at ggml/src/ggml.c:19296 #5 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #6 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 15 (Thread 0x79f5e03ee6c0 (LWP 22389) "llama-bench"): #0 0x00005799cc04587a in __cpu_relax () at ggml/src/ggml.c:3061 #1 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #2 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabf200) at ggml/src/ggml.c:19206 #3 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabf200) at ggml/src/ggml.c:19296 #4 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #5 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 14 (Thread 0x79f5e0bef6c0 (LWP 22388) "llama-bench"): #0 ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3139 #1 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabefd8) at ggml/src/ggml.c:19206 #2 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabefd8) at ggml/src/ggml.c:19296 #3 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #4 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 13 (Thread 0x79f5e13f06c0 (LWP 22387) "llama-bench"): #0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335 #1 __cpu_relax () at ggml/src/ggml.c:3060 #2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabedb0) at ggml/src/ggml.c:19206 #4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabedb0) at ggml/src/ggml.c:19296 #5 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #6 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 12 (Thread 0x79f5e1bf16c0 (LWP 22386) "llama-bench"): #0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335 #1 __cpu_relax () at ggml/src/ggml.c:3060 #2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabeb88) at ggml/src/ggml.c:19206 #4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabeb88) at ggml/src/ggml.c:19296 #5 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #6 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 11 (Thread 0x79f5e23f26c0 (LWP 22385) "llama-bench"): #0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335 #1 __cpu_relax () at ggml/src/ggml.c:3060 #2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabe960) at ggml/src/ggml.c:19206 #4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabe960) at ggml/src/ggml.c:19296 #5 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #6 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 10 (Thread 0x79f5e2bf36c0 (LWP 22384) "llama-bench"): #0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335 #1 __cpu_relax () at ggml/src/ggml.c:3060 #2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabe738) at ggml/src/ggml.c:19206 #4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabe738) at ggml/src/ggml.c:19296 #5 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #6 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 9 (Thread 0x79f5e33f46c0 (LWP 22383) "llama-bench"): #0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335 #1 __cpu_relax () at ggml/src/ggml.c:3060 #2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabe510) at ggml/src/ggml.c:19206 #4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabe510) at ggml/src/ggml.c:19296 #5 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #6 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 8 (Thread 0x79f5e3bf56c0 (LWP 22382) "llama-bench"): #0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335 #1 __cpu_relax () at ggml/src/ggml.c:3060 #2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabe2e8) at ggml/src/ggml.c:19206 #4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabe2e8) at ggml/src/ggml.c:19296 #5 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #6 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 7 (Thread 0x79f5e43f66c0 (LWP 22381) "llama-bench"): #0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335 #1 __cpu_relax () at ggml/src/ggml.c:3060 #2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabe0c0) at ggml/src/ggml.c:19206 #4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabe0c0) at ggml/src/ggml.c:19296 #5 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #6 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 6 (Thread 0x79f5e4bf76c0 (LWP 22380) "llama-bench"): #0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335 #1 __cpu_relax () at ggml/src/ggml.c:3060 #2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabde98) at ggml/src/ggml.c:19206 #4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabde98) at ggml/src/ggml.c:19296 #5 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #6 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 5 (Thread 0x79f5e53f86c0 (LWP 22379) "llama-bench"): #0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335 #1 __cpu_relax () at ggml/src/ggml.c:3060 #2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabdc70) at ggml/src/ggml.c:19206 #4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabdc70) at ggml/src/ggml.c:19296 #5 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #6 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 4 (Thread 0x79f5e5bf96c0 (LWP 22378) "llama-bench"): #0 0x00005799cc04587a in __cpu_relax () at ggml/src/ggml.c:3061 #1 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #2 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabda48) at ggml/src/ggml.c:19206 #3 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabda48) at ggml/src/ggml.c:19296 #4 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #5 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 3 (Thread 0x79f5e63fa6c0 (LWP 22377) "llama-bench"): #0 0x00005799cc04587a in __cpu_relax () at ggml/src/ggml.c:3061 #1 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #2 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabd820) at ggml/src/ggml.c:19206 #3 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabd820) at ggml/src/ggml.c:19296 #4 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #5 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 2 (Thread 0x79f5e6bfb6c0 (LWP 22376) "llama-bench"): #0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335 #1 __cpu_relax () at ggml/src/ggml.c:3060 #2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabd5f8) at ggml/src/ggml.c:19206 #4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabd5f8) at ggml/src/ggml.c:19296 #5 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #6 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 1 (Thread 0x79f5e9dc4c40 (LWP 22375) "llama-bench"): #0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335 #1 __cpu_relax () at ggml/src/ggml.c:3060 #2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabd3d0) at ggml/src/ggml.c:19206 #4 0x00005799cc0813a2 in ggml_graph_compute (cgraph=0x5799cda66998, cplan=0x7ffc9d6a28a0) at ggml/src/ggml.c:19502 #5 0x00005799cc092a1a in ggml_backend_cpu_graph_compute (backend=0x5799cda872b0, cgraph=0x5799cda66998) at ggml/src/ggml-backend.c:817 #6 0x00005799cc091807 in ggml_backend_graph_compute_async (backend=0x5799cda872b0, cgraph=0x5799cda66998) at ggml/src/ggml-backend.c:282 #7 0x00005799cc0966b4 in ggml_backend_sched_compute_splits (sched=0x5799cda5f370) at ggml/src/ggml-backend.c:1805 #8 0x00005799cc0972c8 in ggml_backend_sched_graph_compute_async (sched=0x5799cda5f370, graph=0x79f5e82df030) at ggml/src/ggml-backend.c:1992 #9 0x00005799cc1295a8 in llama_graph_compute (lctx=..., gf=0x79f5e82df030, n_threads=16, threadpool=0x5799cda65990) at src/llama.cpp:14527 #10 0x00005799cc12a344 in llama_decode_internal (lctx=..., batch_all=...) at src/llama.cpp:14781 #11 0x00005799cc1382f1 in llama_decode (ctx=0x5799cda5fe80, batch=...) at src/llama.cpp:18600 #12 0x00005799cc354d9b in test_prompt (ctx=0x5799cda5fe80, n_prompt=512, n_past=0, n_batch=2048, n_threads=16) at examples/llama-bench/llama-bench.cpp:1349 #13 0x00005799cc3558a6 in main (argc=7, argv=0x7ffc9d6a3798) at examples/llama-bench/llama-bench.cpp:1485

Aug 15 '24 18:08 slaren

@slaren

The BLAS backend is still important at least in macOS because Accelerate is significantly faster. OpenMP is also not available in macOS. In my opinion this is the most important use case, because other platforms have access to good implementations of OpenMP.

Got it. Makes sense for the BLAS then.

For other platforms there are several other advantages of using dedicated threadpool vs OpenMP. Things like ability to specify the affinity masks, priorities, etc specific to llama.cpp/ggml instances. With OpenMP those settings are global per process. ie if an app that links to libllama.so/libggml.so uses OpenMP for other stuff (say it's linking some other lib that uses OpenMP) then the settings conflict with each other. There are other things like being able to reuse threadpools between llama_ctx, reduced dependencies, etc.

I think I hit a deadlock when testing with LLAMA_DEBUG=1 GGML_BLIS=1 GGML_NO_OPENMP=1 GGML_NO_LLAMAFILE=1:

./llama-bench -m models/stories260K.gguf -r 10 -t 16

Oh. Odd. I thought I tested that use-case. Will follow up asap.

Aug 15 '24 18:08 max-krasnyansky

@slaren Sorry for not catching this earlier. The timing had to be just right to trigger that race. I reproduced it with while true; do ./llama-bench -m ../gguf/stories260K.gguf -r 10 -t 16; done on my Ryzen9 system. Fixed it, and let that loop run for a couple of hours to make sure.

Do you think we can merge this more or less as is and then work on extending things to accommodate the BLAS backend as a follow up? See above for general benefits vs OpenMP, and it speeds things up on the ARM64 CPUs, see my report above. It'd be good to get Windows on ARM, and Android releases going with the threadpool enabled. We'll definitely follow up on the BLAS and there are further ideas as well (reusing temporary pools, etc).

Aug 15 '24 23:08 max-krasnyansky

@slaren Another quick question. ggml-blas.cpp is C++ and is using C++11 stuff like std::future when OpenMP is disabled.

Would it be OK to do Thread Pool V3 in C++? We can add some extern "C" APIs to call from ggml.c but it'd be nice if the core threadpool logic was in C++ (with clean std::atomic, std::thread, ...). This way we could remove pthread wrappers and things, we'll still need a few OS specific functions for the CPU affinity and priority stuff but the core bits will just be clean C++11. We'd create ggml-thread.cpp and implement all threading/cpu/numa related stuff in there, again with some extern "C" APIs for the rest of GGML.

We could do this as the followup to this current Thread Pool V2 version. Please see the question/suggestion above.

Aug 16 '24 00:08 max-krasnyansky

The deadlock also seems to be fixed here. I think we can merge if there aren't any significant performance regressions. I will do a more in depth review in the following days, so far I have only looked at the performance. Using C++ would be good as long as the public ggml interface remains compatible with C, in the future we will probably continue porting parts of ggml to C++.

Aug 16 '24 01:08 slaren

As an extra data point: I'm not seeing a performance regression on this branch on my EPYC system. I'm seeing a single-digit percentage speedup vs master, in fact.

Aug 19 '24 19:08 cpumaxx

@ggerganov @slaren Do you have any more suggestions/comments/concerns regarding this PR? I would suggest we merge it in and create issues to track BLAS/BLIS improvements and/or moving to C++ synchronization primitives

Aug 23 '24 17:08 fmz

Not critical, but I noticed that there is a performance regression with partial offloading (-ngl 10) with Metal, at least with small models: scripts/compare-commits.sh master threadpool -m models/tinyllama-1.1b-intermediate-step-480k-1t.Q8_0.gguf -m models/llama-2-7b/ggml-model-Q4_0.gguf -t 4,8 -ngl 10

CPU	Model	Model Size [GiB]	Threads	Test	t/s master	t/s threadpool	Speedup
M3 Max	llama 1B Q8_0	1.09	4	pp512	1163.31	1067.73	0.92
M3 Max	llama 1B Q8_0	1.09	4	tg128	102.36	83.20	0.81
M3 Max	llama 1B Q8_0	1.09	8	pp512	1292.88	1184.32	0.92
M3 Max	llama 1B Q8_0	1.09	8	tg128	104.22	89.58	0.86
M3 Max	llama 7B Q4_0	3.56	4	pp512	185.83	185.25	1.00
M3 Max	llama 7B Q4_0	3.56	4	tg128	24.28	24.21	1.00
M3 Max	llama 7B Q4_0	3.56	8	pp512	194.85	191.47	0.98
M3 Max	llama 7B Q4_0	3.56	8	tg128	30.21	31.09	1.03

Aug 24 '24 01:08 slaren