Threadpool: take 2
ref: original PR #7526
Added an API to support explicit management and fine-grain control of threadpools. The API supports creating different threadpools for various parts of execution, e.g. batch, single-token, etc. Each threadpool can be created, paused, resumed, and released independently from any other threadpools. This mitigates the overhead of starting/stopping threads for each decode call and helps OSes keep track of scheduling history in order to make better scheduling decisions.
Each threadpool supports:
Setting number of threads (duh) Setting a CPU mask for threads to be placed on Support for strict/relaxed placement: pinning specific threads to specific cores, or letting the OS decide Support for polling/interrupt-driven wait Setting thread priority Using threadpools explicitly is optional. If a llama_decode is called with a llama_context that doesn't have a threadpool attached, a disposable threadpool is created (same as the current behavior). If users choose to explicitly use threadpools, they have to manage them manually. See example in main.cpp.
With all the bells and whistles enabled, we generally see a minor improvement vs OMP. Without polling, threadpool runs on par with OMP.
- [x] I have read the contributing guidelines
- Self-reported review complexity:
- [ ] Low
- [x] Medium
- [ ] High
Here are some perf figures:
On W-2225 Xeon machine: CPU backend:
| CPU | Model | Test | t/s master | t/s threadpool | Speedup |
|---|---|---|---|---|---|
| Intel(R) Xeon(R) W-2225 CPU @ 4.10GHz | llama 7B Q4_0 | pp512 | 17.46 | 17.51 | 1.00 |
| Intel(R) Xeon(R) W-2225 CPU @ 4.10GHz | llama 7B Q4_0 | tg128 | 6.98 | 7.06 | 1.01 |
Intel 10th-gen CPU: ./scripts/compare-commits.sh master threadpool -t 1,2,4,6,8,10
| CPU | Model | Threads | Test | t/s master | t/s threadpool-attempt-2 | Speedup |
|---|---|---|---|---|---|---|
| Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz | llama 7B Q4_0 | 1 | pp512 | 3.93 | 3.94 | 1.00 |
| Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz | llama 7B Q4_0 | 1 | tg128 | 2.43 | 2.44 | 1.00 |
| Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz | llama 7B Q4_0 | 2 | pp512 | 7.13 | 7.06 | 0.99 |
| Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz | llama 7B Q4_0 | 2 | tg128 | 4.37 | 4.36 | 1.00 |
| Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz | llama 7B Q4_0 | 4 | pp512 | 11.96 | 11.99 | 1.00 |
| Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz | llama 7B Q4_0 | 4 | tg128 | 6.79 | 6.77 | 1.00 |
| Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz | llama 7B Q4_0 | 6 | pp512 | 14.96 | 14.98 | 1.00 |
| Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz | llama 7B Q4_0 | 6 | tg128 | 7.51 | 7.53 | 1.00 |
| Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz | llama 7B Q4_0 | 8 | pp512 | 13.06 | 13.09 | 1.00 |
| Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz | llama 7B Q4_0 | 8 | tg128 | 6.88 | 6.83 | 0.99 |
| Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz | llama 7B Q4_0 | 10 | pp512 | 14.08 | 14.06 | 1.00 |
| Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz | llama 7B Q4_0 | 10 | tg128 | 7.49 | 7.52 | 1.00 |
Mobile NVIDIA 3060: $ LLAMA_CUDA=1 ./scripts/compare-commits.sh master threadpool -nkvo 0,1
| GPU | Model | NKVO | Test | t/s master | t/s threadpool-attempt-2 | Speedup |
|---|---|---|---|---|---|---|
| RTX 3060 Laptop GPU | llama 7B Q4_0 | No | pp512 | 1644.73 | 1642.34 | 1.00 |
| RTX 3060 Laptop GPU | llama 7B Q4_0 | No | tg128 | 65.94 | 65.89 | 1.00 |
| RTX 3060 Laptop GPU | llama 7B Q4_0 | Yes | pp512 | 287.28 | 286.44 | 1.00 |
| RTX 3060 Laptop GPU | llama 7B Q4_0 | Yes | tg128 | 54.56 | 54.32 | 1.00 |
@slaren Threadpool is back! Updated it a bit to be aligned with the latest graph-compute design. The current performance is largely on par with OpenMP. Please lmk if you have any comments/suggestions?
I tried to test this on macOS, but it seems to deadlock.
WARNING: ThreadSanitizer: data race (pid=62377)
Write of size 1 at 0x00010ab02a8e by main thread:
#0 ggml_graph_compute ggml.c:19365 (llama-bench:arm64+0x10003fb54)
#1 ggml_backend_cpu_graph_compute ggml-backend.c:822 (llama-bench:arm64+0x1000a5f1c)
#2 ggml_backend_graph_compute_async ggml-backend.c:282 (llama-bench:arm64+0x10009bac0)
#3 ggml_backend_sched_compute_splits ggml-backend.c:1795 (llama-bench:arm64+0x1000a3190)
#4 ggml_backend_sched_graph_compute_async ggml-backend.c:1979 (llama-bench:arm64+0x1000a2d24)
#5 llama_graph_compute(llama_context&, ggml_cgraph*, int, ggml_compute_threadpool*) llama.cpp:14412 (llama-bench:arm64+0x100292cac)
#6 llama_decode_internal(llama_context&, llama_batch) llama.cpp:14666 (llama-bench:arm64+0x1000fda4c)
#7 llama_decode llama.cpp:18489 (llama-bench:arm64+0x1000fc460)
#8 test_prompt(llama_context*, int, int, int, int) llama-bench.cpp:1319 (llama-bench:arm64+0x10062bd5c)
#9 main llama-bench.cpp:1454 (llama-bench:arm64+0x100627180)
Previous read of size 1 at 0x00010ab02a8e by thread T12 (mutexes: write M0):
#0 ggml_graph_compute_check_for_work ggml.c:19152 (llama-bench:arm64+0x100053a10)
#1 ggml_graph_compute_secondary_thread ggml.c:19189 (llama-bench:arm64+0x1000537dc)
Location is heap block of size 192 at 0x00010ab02a00 allocated by main thread:
#0 posix_memalign <null>:96112260 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x564c0)
#1 ggml_aligned_malloc ggml.c:241 (llama-bench:arm64+0x10001ac88)
#2 ggml_create_threadpool_impl ggml.c:19214 (llama-bench:arm64+0x10003f14c)
#3 ggml_create_threadpool ggml.c:19292 (llama-bench:arm64+0x10003f0b8)
#4 main llama-bench.cpp:1418 (llama-bench:arm64+0x100626cd0)
Mutex M0 (0x00010ab02a00) created at:
#0 pthread_mutex_init <null>:96112260 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x31470)
#1 ggml_create_threadpool_impl ggml.c:19238 (llama-bench:arm64+0x10003f404)
#2 ggml_create_threadpool ggml.c:19292 (llama-bench:arm64+0x10003f0b8)
#3 main llama-bench.cpp:1418 (llama-bench:arm64+0x100626cd0)
Thread T12 (tid=36579987, running) created by main thread at:
#0 pthread_create <null>:96112260 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x3062c)
#1 ggml_create_threadpool_impl ggml.c:19277 (llama-bench:arm64+0x10003f638)
#2 ggml_create_threadpool ggml.c:19292 (llama-bench:arm64+0x10003f0b8)
#3 main llama-bench.cpp:1418 (llama-bench:arm64+0x100626cd0)
SUMMARY: ThreadSanitizer: data race ggml.c:19365 in ggml_graph_compute
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x000000012481fc00) at ggml.c:19132:5
frame #2: 0x0000000104ba17ec llama-bench`ggml_graph_compute(cgraph=0x00000001182901b8, cplan=0x000000016b28a730) at ggml.c:19373:5
frame #3: 0x0000000104be8400 llama-bench`ggml_backend_cpu_graph_compute(backend=0x00006000039680b0, cgraph=0x00000001182901b8) at ggml-backend.c:822:12
frame #4: 0x0000000104be23c4 llama-bench`ggml_backend_graph_compute_async(backend=0x00006000039680b0, cgraph=0x00000001182901b8) at ggml-backend.c:282:12
frame #5: 0x0000000104be6834 llama-bench`ggml_backend_sched_compute_splits(sched=0x0000000115000000) at ggml-backend.c:1795:35
frame #6: 0x0000000104be65a0 llama-bench`ggml_backend_sched_graph_compute_async(sched=0x0000000115000000, graph=0x0000000118420020) at ggml-backend.c:1979:12
frame #7: 0x0000000104d09f44 llama-bench`llama_graph_compute(lctx=0x0000000114813e00, gf=0x0000000118420020, n_threads=12, threadpool=0x0000600003b6c3c0) at llama.cpp:14412:5
frame #8: 0x0000000104c2b148 llama-bench`llama_decode_internal(lctx=0x0000000114813e00, batch_all=llama_batch @ 0x000000016b28ac60) at llama.cpp:14666:9
frame #9: 0x0000000104c2a15c llama-bench`llama_decode(ctx=0x0000000114813e00, batch=llama_batch @ 0x000000016b28ad08) at llama.cpp:18489:21
frame #10: 0x0000000104f3ecbc llama-bench`test_prompt(ctx=0x0000000114813e00, n_prompt=32, n_past=0, n_batch=2048, n_threads=12) at llama-bench.cpp:1319:9
frame #11: 0x0000000104f3ae44 llama-bench`main(argc=9, argv=0x000000016b28b940) at llama-bench.cpp:1454:13
frame #12: 0x000000018fbae0e0 dyld`start + 2360
thread #2
frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x000000012481fe20) at ggml.c:19132:5
frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x000000012481fe20) at ggml.c:19191:37
frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #3
frame #0: 0x0000000104baf690 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3019:54
frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124820040) at ggml.c:19132:5
frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820040) at ggml.c:19191:37
frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #4
frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124820260) at ggml.c:19132:5
frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820260) at ggml.c:19191:37
frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #5
frame #0: 0x0000000104baf690 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3019:54
frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124820480) at ggml.c:19132:5
frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820480) at ggml.c:19191:37
frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #6
frame #0: 0x0000000104baf690 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3019:54
frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x00000001248206a0) at ggml.c:19132:5
frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001248206a0) at ggml.c:19191:37
frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #7
frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001248208c0) at ggml.c:19154:17
frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001248208c0) at ggml.c:19189:25
frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #8
frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124820ae0) at ggml.c:19132:5
frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820ae0) at ggml.c:19191:37
frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #9
frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000124820d00) at ggml.c:19154:17
frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820d00) at ggml.c:19189:25
frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #10
frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124820f20) at ggml.c:19132:5
frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820f20) at ggml.c:19191:37
frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #11
frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124821140) at ggml.c:19132:5
frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124821140) at ggml.c:19191:37
frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #12
frame #0: 0x0000000104baf690 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3019:54
frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124821360) at ggml.c:19132:5
frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124821360) at ggml.c:19191:37
frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #13
frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000124821580) at ggml.c:19154:17
frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124821580) at ggml.c:19189:25
frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #14
frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001248217a0) at ggml.c:19154:17
frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001248217a0) at ggml.c:19189:25
frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #15
frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001248219c0) at ggml.c:19154:17
frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001248219c0) at ggml.c:19189:25
frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #16
frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000124821be0) at ggml.c:19154:17
frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124821be0) at ggml.c:19189:25
frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #17
frame #0: 0x000000018fef7ea4 libsystem_kernel.dylib`__workq_kernreturn + 8
I tried to test this on macOS, but it seems to deadlock.
Fixed!
On M2 Max: (GGML_NO_METAL=1 GGML_NO_ACCELERATE=1)
| CPU | Model | Threads | Test | t/s master | t/s threadpool | Speedup |
|---|---|---|---|---|---|---|
| llama 7B Q4_0 | 4 | pp512 | 32.97 | 34.87 | 1.06 | |
| llama 7B Q4_0 | 4 | tg128 | 18.01 | 18.37 | 1.02 | |
| llama 7B Q4_0 | 6 | pp512 | 47.43 | 48.99 | 1.03 | |
| llama 7B Q4_0 | 6 | tg128 | 23.10 | 23.32 | 1.01 | |
| llama 7B Q4_0 | 8 | pp512 | 49.90 | 55.17 | 1.11 | |
| llama 7B Q4_0 | 8 | tg128 | 18.09 | 21.98 | 1.22 | |
| llama 7B Q4_0 | 10 | pp512 | 52.50 | 56.69 | 1.08 | |
| llama 7B Q4_0 | 10 | tg128 | 14.24 | 8.54 | 0.60 | |
| llama 7B Q4_0 | 12 | pp512 | 56.37 | 56.93 | 1.01 | |
| llama 7B Q4_0 | 12 | tg128 | 5.02 | 9.44 | 1.88 |
Same thing, but with llama-v3 8B Q4_0_4_4 (for some reason my compiler AppleClang15 doesn't support INT8 matmul?)
| CPU | Model | Threads | Test | t/s master | t/s threadpool | Speedup |
|---|---|---|---|---|---|---|
| llama 8B Q4_0_4_4 | 4 | pp512 | 72.44 | 72.83 | 1.01 | |
| llama 8B Q4_0_4_4 | 4 | tg128 | 22.29 | 23.50 | 1.05 | |
| llama 8B Q4_0_4_4 | 6 | pp512 | 98.71 | 100.21 | 1.02 | |
| llama 8B Q4_0_4_4 | 6 | tg128 | 24.63 | 24.44 | 0.99 | |
| llama 8B Q4_0_4_4 | 8 | pp512 | 95.86 | 116.17 | 1.21 | |
| llama 8B Q4_0_4_4 | 8 | tg128 | 21.19 | 26.28 | 1.24 | |
| llama 8B Q4_0_4_4 | 10 | pp512 | 102.37 | 105.18 | 1.03 | |
| llama 8B Q4_0_4_4 | 10 | tg128 | 18.63 | 16.98 | 0.91 | |
| llama 8B Q4_0_4_4 | 12 | pp512 | 108.08 | 101.18 | 0.94 | |
| llama 8B Q4_0_4_4 | 12 | tg128 | 6.22 | 11.39 | 1.83 |
ref: original PR #7526
Added an API to support explicit management and fine-grain control of threadpools. The API supports creating different threadpools for various parts of execution, e.g. batch, single-token, etc. Each threadpool can be created, paused, resumed, and released independently from any other threadpools. This mitigates the overhead of starting/stopping threads for each decode call and helps OSes keep track of scheduling history in order to make better scheduling decisions.
Each threadpool supports:
Setting number of threads (duh) Setting a CPU mask for threads to be placed on Support for strict/relaxed placement: pinning specific threads to specific cores, or letting the OS decide Support for polling/interrupt-driven wait Setting thread priority Using threadpools explicitly is optional. If a llama_decode is called with a llama_context that doesn't have a threadpool attached, a disposable threadpool is created (same as the current behavior). If users choose to explicitly use threadpools, they have to manage them manually. See example in main.cpp.
With all the bells and whistles enabled, we generally see a minor improvement vs OMP. Without polling, threadpool runs on par with OMP.
[x] I have read the contributing guidelines
Self-reported review complexity:
- [ ] Low
- [x] Medium
- [ ] High
If it crashes, can the error message include "deadpool"?
@slaren lmk if it works for you this time
I tested this again on the M3 Max, but it still seems to deadlock. These are the call stacks of the threads:
(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fc00) at ggml.c:19133:5
frame #2: 0x0000000102383190 llama-bench`ggml_graph_compute(cgraph=0x00000001085f81c8, cplan=0x000000016daa2730) at ggml.c:19374:5
frame #3: 0x00000001023c3394 llama-bench`ggml_backend_cpu_graph_compute(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:822:12
frame #4: 0x00000001023bd840 llama-bench`ggml_backend_graph_compute_async(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:282:12
frame #5: 0x00000001023c1864 llama-bench`ggml_backend_sched_compute_splits(sched=0x000000010680c400) at ggml-backend.c:1800:35
frame #6: 0x00000001023c15a0 llama-bench`ggml_backend_sched_graph_compute_async(sched=0x000000010680c400, graph=0x00000001081c0020) at ggml-backend.c:1987:12
frame #7: 0x00000001024e0b58 llama-bench`llama_graph_compute(lctx=0x000000010680fa00, gf=0x00000001081c0020, n_threads=12, threadpool=0x00006000027e43c0) at llama.cpp:14425:5
frame #8: 0x0000000102404938 llama-bench`llama_decode_internal(lctx=0x000000010680fa00, batch_all=llama_batch @ 0x000000016daa2c60) at llama.cpp:14679:9
frame #9: 0x0000000102403a9c llama-bench`llama_decode(ctx=0x000000010680fa00, batch=llama_batch @ 0x000000016daa2d08) at llama.cpp:18499:21
frame #10: 0x0000000102712eac llama-bench`test_prompt(ctx=0x000000010680fa00, n_prompt=32, n_past=0, n_batch=2048, n_threads=12) at llama-bench.cpp:1319:9
frame #11: 0x000000010270f0b8 llama-bench`main(argc=9, argv=0x000000016daa3940) at llama-bench.cpp:1454:13
frame #12: 0x000000018fbae0e0 dyld`start + 2360
thread #2
frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fe20) at ggml.c:19133:5
frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x000000013681fe20) at ggml.c:19192:37
frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #3
frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820040) at ggml.c:19133:5
frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820040) at ggml.c:19192:37
frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #4
frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820260) at ggml.c:19155:17
frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820260) at ggml.c:19190:25
frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #5
frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820480) at ggml.c:19155:17
frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820480) at ggml.c:19190:25
frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #6
frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368206a0) at ggml.c:19133:5
frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368206a0) at ggml.c:19192:37
frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #7
frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368208c0) at ggml.c:19133:5
frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368208c0) at ggml.c:19192:37
frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #8
frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820ae0) at ggml.c:19155:17
frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820ae0) at ggml.c:19190:25
frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #9
frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820d00) at ggml.c:19133:5
frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820d00) at ggml.c:19192:37
frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #10
frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820f20) at ggml.c:19155:17
frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820f20) at ggml.c:19190:25
frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #11
frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821140) at ggml.c:19155:17
frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821140) at ggml.c:19190:25
frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #12
frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821360) at ggml.c:19155:17
frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821360) at ggml.c:19190:25
frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #13
frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821580) at ggml.c:19155:17
frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821580) at ggml.c:19190:25
frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #14
frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368217a0) at ggml.c:19155:17
frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368217a0) at ggml.c:19190:25
frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #15
frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368219c0) at ggml.c:19155:17
frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368219c0) at ggml.c:19190:25
frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #16
frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821be0) at ggml.c:19155:17
frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821be0) at ggml.c:19190:25
frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
thread #17
frame #0: 0x000000018fef7ea4 libsystem_kernel.dylib`__workq_kernreturn + 8
Built with LLAMA_DEBUG=1 GGML_NO_METAL=1 make llama-bench && ./llama-bench -m models/llama-2-7b/ggml-model-Q4_0.gguf -n 0 -r 1 -p 32.
I tested this again on the M3 Max, but it still seems to deadlock. These are the call stacks of the threads:
(lldb) bt all * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29 frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fc00) at ggml.c:19133:5 frame #2: 0x0000000102383190 llama-bench`ggml_graph_compute(cgraph=0x00000001085f81c8, cplan=0x000000016daa2730) at ggml.c:19374:5 frame #3: 0x00000001023c3394 llama-bench`ggml_backend_cpu_graph_compute(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:822:12 frame #4: 0x00000001023bd840 llama-bench`ggml_backend_graph_compute_async(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:282:12 frame #5: 0x00000001023c1864 llama-bench`ggml_backend_sched_compute_splits(sched=0x000000010680c400) at ggml-backend.c:1800:35 frame #6: 0x00000001023c15a0 llama-bench`ggml_backend_sched_graph_compute_async(sched=0x000000010680c400, graph=0x00000001081c0020) at ggml-backend.c:1987:12 frame #7: 0x00000001024e0b58 llama-bench`llama_graph_compute(lctx=0x000000010680fa00, gf=0x00000001081c0020, n_threads=12, threadpool=0x00006000027e43c0) at llama.cpp:14425:5 frame #8: 0x0000000102404938 llama-bench`llama_decode_internal(lctx=0x000000010680fa00, batch_all=llama_batch @ 0x000000016daa2c60) at llama.cpp:14679:9 frame #9: 0x0000000102403a9c llama-bench`llama_decode(ctx=0x000000010680fa00, batch=llama_batch @ 0x000000016daa2d08) at llama.cpp:18499:21 frame #10: 0x0000000102712eac llama-bench`test_prompt(ctx=0x000000010680fa00, n_prompt=32, n_past=0, n_batch=2048, n_threads=12) at llama-bench.cpp:1319:9 frame #11: 0x000000010270f0b8 llama-bench`main(argc=9, argv=0x000000016daa3940) at llama-bench.cpp:1454:13 frame #12: 0x000000018fbae0e0 dyld`start + 2360 thread #2 frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29 frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fe20) at ggml.c:19133:5 frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x000000013681fe20) at ggml.c:19192:37 frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #3 frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29 frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820040) at ggml.c:19133:5 frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820040) at ggml.c:19192:37 frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #4 frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820260) at ggml.c:19155:17 frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820260) at ggml.c:19190:25 frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #5 frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820480) at ggml.c:19155:17 frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820480) at ggml.c:19190:25 frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #6 frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29 frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368206a0) at ggml.c:19133:5 frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368206a0) at ggml.c:19192:37 frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #7 frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29 frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368208c0) at ggml.c:19133:5 frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368208c0) at ggml.c:19192:37 frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #8 frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820ae0) at ggml.c:19155:17 frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820ae0) at ggml.c:19190:25 frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #9 frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29 frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820d00) at ggml.c:19133:5 frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820d00) at ggml.c:19192:37 frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #10 frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820f20) at ggml.c:19155:17 frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820f20) at ggml.c:19190:25 frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #11 frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821140) at ggml.c:19155:17 frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821140) at ggml.c:19190:25 frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #12 frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821360) at ggml.c:19155:17 frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821360) at ggml.c:19190:25 frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #13 frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821580) at ggml.c:19155:17 frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821580) at ggml.c:19190:25 frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #14 frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368217a0) at ggml.c:19155:17 frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368217a0) at ggml.c:19190:25 frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #15 frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368219c0) at ggml.c:19155:17 frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368219c0) at ggml.c:19190:25 frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #16 frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821be0) at ggml.c:19155:17 frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821be0) at ggml.c:19190:25 frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #17 frame #0: 0x000000018fef7ea4 libsystem_kernel.dylib`__workq_kernreturn + 8Built with
LLAMA_DEBUG=1 GGML_NO_METAL=1 make llama-bench && ./llama-bench -m models/llama-2-7b/ggml-model-Q4_0.gguf -n 0 -r 1 -p 32.
Bummer... Thanks for the details! Looks like we got some trouble in the "ACCELERATE" path I'll fix it asap
I tested this again on the M3 Max, but it still seems to deadlock. These are the call stacks of the threads:
(lldb) bt all * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29 frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fc00) at ggml.c:19133:5 frame #2: 0x0000000102383190 llama-bench`ggml_graph_compute(cgraph=0x00000001085f81c8, cplan=0x000000016daa2730) at ggml.c:19374:5 frame #3: 0x00000001023c3394 llama-bench`ggml_backend_cpu_graph_compute(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:822:12 frame #4: 0x00000001023bd840 llama-bench`ggml_backend_graph_compute_async(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:282:12 frame #5: 0x00000001023c1864 llama-bench`ggml_backend_sched_compute_splits(sched=0x000000010680c400) at ggml-backend.c:1800:35 frame #6: 0x00000001023c15a0 llama-bench`ggml_backend_sched_graph_compute_async(sched=0x000000010680c400, graph=0x00000001081c0020) at ggml-backend.c:1987:12 frame #7: 0x00000001024e0b58 llama-bench`llama_graph_compute(lctx=0x000000010680fa00, gf=0x00000001081c0020, n_threads=12, threadpool=0x00006000027e43c0) at llama.cpp:14425:5 frame #8: 0x0000000102404938 llama-bench`llama_decode_internal(lctx=0x000000010680fa00, batch_all=llama_batch @ 0x000000016daa2c60) at llama.cpp:14679:9 frame #9: 0x0000000102403a9c llama-bench`llama_decode(ctx=0x000000010680fa00, batch=llama_batch @ 0x000000016daa2d08) at llama.cpp:18499:21 frame #10: 0x0000000102712eac llama-bench`test_prompt(ctx=0x000000010680fa00, n_prompt=32, n_past=0, n_batch=2048, n_threads=12) at llama-bench.cpp:1319:9 frame #11: 0x000000010270f0b8 llama-bench`main(argc=9, argv=0x000000016daa3940) at llama-bench.cpp:1454:13 frame #12: 0x000000018fbae0e0 dyld`start + 2360 thread #2 frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29 frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fe20) at ggml.c:19133:5 frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x000000013681fe20) at ggml.c:19192:37 frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #3 frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29 frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820040) at ggml.c:19133:5 frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820040) at ggml.c:19192:37 frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #4 frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820260) at ggml.c:19155:17 frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820260) at ggml.c:19190:25 frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #5 frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820480) at ggml.c:19155:17 frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820480) at ggml.c:19190:25 frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #6 frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29 frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368206a0) at ggml.c:19133:5 frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368206a0) at ggml.c:19192:37 frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #7 frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29 frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368208c0) at ggml.c:19133:5 frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368208c0) at ggml.c:19192:37 frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #8 frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820ae0) at ggml.c:19155:17 frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820ae0) at ggml.c:19190:25 frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #9 frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29 frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820d00) at ggml.c:19133:5 frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820d00) at ggml.c:19192:37 frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #10 frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820f20) at ggml.c:19155:17 frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820f20) at ggml.c:19190:25 frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #11 frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821140) at ggml.c:19155:17 frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821140) at ggml.c:19190:25 frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #12 frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821360) at ggml.c:19155:17 frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821360) at ggml.c:19190:25 frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #13 frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821580) at ggml.c:19155:17 frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821580) at ggml.c:19190:25 frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #14 frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368217a0) at ggml.c:19155:17 frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368217a0) at ggml.c:19190:25 frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #15 frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368219c0) at ggml.c:19155:17 frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368219c0) at ggml.c:19190:25 frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #16 frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228 frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821be0) at ggml.c:19155:17 frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821be0) at ggml.c:19190:25 frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136 thread #17 frame #0: 0x000000018fef7ea4 libsystem_kernel.dylib`__workq_kernreturn + 8Built with
LLAMA_DEBUG=1 GGML_NO_METAL=1 make llama-bench && ./llama-bench -m models/llama-2-7b/ggml-model-Q4_0.gguf -n 0 -r 1 -p 32.Bummer... Thanks for the details! Looks like we got some trouble in the "ACCELERATE" path I'll fix it asap
@slaren turns out there was a bit of a corner case where if you have a graph with only 1 node, ggml_barrier and wait_for_work deadlock on each other. Added a check to handle that specific case
Thanks, I was able to run it now. Unfortunately the results are still not very good on my system. Under WSL this threadpool is much slower than OpenMP. A threadpool would be more important on macOS, since OpenMP is not available there, but for me it is also slower on the M3 Max.
M3 Max:
GGML_NO_METAL=1 scripts/compare-commits.sh master threadpool -m models/llama-2-7b/ggml-model-Q4_0.gguf
| CPU | Model | Model Size [GiB] | Test | t/s master | t/s threadpool | Speedup |
|---|---|---|---|---|---|---|
| M3 Max | llama 7B Q4_0 | 3.56 | pp512 | 151.21 | 149.88 | 0.99 |
| M3 Max | llama 7B Q4_0 | 3.56 | tg128 | 30.06 | 26.09 | 0.87 |
13900k + 3090Ti:
OpenMP (GGML_CUDA=1 make llama-bench && ./llama-bench -nkvo 0,1)
| model | size | params | backend | ngl | nkvo | test | t/s |
|---|---|---|---|---|---|---|---|
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 0 | pp512 | 5699.53 ± 19.73 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 0 | tg128 | 150.75 ± 1.23 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 1 | pp512 | 651.63 ± 32.31 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 1 | tg128 | 63.85 ± 3.22 |
Threadpool (GGML_CUDA=1 GGML_NO_OPENMP=1 make llama-bench && ./llama-bench -nkvo 0,1)
| model | size | params | backend | ngl | nkvo | test | t/s |
|---|---|---|---|---|---|---|---|
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 0 | pp512 | 5453.33 ± 216.72 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 0 | tg128 | 144.45 ± 0.98 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 1 | pp512 | 566.43 ± 27.64 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 1 | tg128 | 29.54 ± 0.99 |
build: bebe99c2 (3500)
Thanks, I was able to run it now. Unfortunately the results are still not very good on my system. Under WSL this threadpool is much slower than OpenMP. A threadpool would be more important on macOS, since OpenMP is not available there, but for me it is also slower on the M3 Max.
M3 Max:
GGML_NO_METAL=1 scripts/compare-commits.sh master threadpool -m models/llama-2-7b/ggml-model-Q4_0.ggufCPU Model Model Size [GiB] Test t/s master t/s threadpool Speedup M3 Max llama 7B Q4_0 3.56 pp512 151.21 149.88 0.99 M3 Max llama 7B Q4_0 3.56 tg128 30.06 26.09 0.87 13900k + 3090Ti: OpenMP (
GGML_CUDA=1 make llama-bench && ./llama-bench -nkvo 0,1)model size params backend ngl nkvo test t/s llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 5699.53 ± 19.73 llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 150.75 ± 1.23 llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 651.63 ± 32.31 llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 63.85 ± 3.22 Threadpool (
GGML_CUDA=1 GGML_NO_OPENMP=1 make llama-bench && ./llama-bench -nkvo 0,1)model size params backend ngl nkvo test t/s llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 5453.33 ± 216.72 llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 144.45 ± 0.98 llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 566.43 ± 27.64 llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 29.54 ± 0.99 build: bebe99c (3500)
Ooof... That is quite a bit slower. I'll try to replicate this locally
@fmz @slaren
I fixed one of the issues that was causing regressions. We were setting the default number of threads in the threadpool using std::thread::hardware_concurrency(). I updated that to use cpu_get_num_math() this way we are going to exclude E-Cores and Hypethreading siblings.
This is what was causing regressions with the default cmd line args where the number of threads is not explicitly specified.
We were starting 12 threads on M2 Max, where only 8 cores are really usable, same on AMD EPYC (using siblings) and Intel 13/14th Gen (using E-Cores).
I'm also working on another fix which is specific to llama-bench. Currently (in the threadpool branch) we start a single threadpool with max-num-threads and reuse it for each test. Suppose the test is using 4 threads but we'd start 12 (on M2 Max or Snapdragon X-Elite).
This is suboptimal because the spinning threads interfere with Core boosting and things. It's better to start a fresh threadpool for each test.
@fmz @slaren llama-bench has been updated as I described above.
Here are the numbers from M2 Max. I'll share numbers for an AMD EPYC server, Snapdragon X-Elite and Gen-3 a bit later.
CC=clang CXX=clang++ CFLAGS="-march=armv8.7-a" CXXFLAGS="-march=armv8.7-a" GGML_NO_METAL=1 GGML_NO_ACCELERATE=1 \
./scripts/compare-commits.sh master threadpool -m ../gguf/llama-v3.1.q4_0_4_8.gguf -ngl 0 -t 4,6,8
...
+ ./scripts/compare-llama-bench.py -b master -c threadpool
| CPU | Model | Threads | Test | t/s master | t/s threadpool | Speedup |
|:------|:------------------|----------:|:-------|-------------:|-----------------:|----------:|
| | llama 8B Q4_0_4_8 | 4 | pp512 | 64.43 | 64.52 | 1.00 |
| | llama 8B Q4_0_4_8 | 4 | tg128 | 22.53 | 24.36 | 1.08 |
| | llama 8B Q4_0_4_8 | 6 | pp512 | 89.79 | 91.04 | 1.01 |
| | llama 8B Q4_0_4_8 | 6 | tg128 | 24.73 | 26.21 | 1.06 |
| | llama 8B Q4_0_4_8 | 8 | pp512 | 117.14 | 118.67 | 1.01 |
| | llama 8B Q4_0_4_8 | 8 | tg128 | 26.11 | 26.37 | 1.01 |
The performance looks better now, with nkvo it is comparable to OpenMP, which is very good. There is still a performance drop when using the BLAS backend (this includes the default build in macOS, which uses Accelerate). I suspect that this is because the threads are spinning in ggml_graph_compute_check_for_work while the BLAS backend is running. This will also cause the threads to spin while the GPU backend is running when partially offloading, which would be a reggression. Rather than requiring the user to disable polling manually, I suggest implementing some kind of backoff and yield the threads after spinning for a while. The BLAS backend (ggml-blas.cpp) may also benefit from using the threadpool, since it launches several threads to dequantize the weights, and it could also automatically pause the pool during the call to the BLAS library.
Results
GGML_CUDA=1 make llama-bench > /dev/null && ./llama-bench -nkvo 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | nkvo | test | t/s |
|---|---|---|---|---|---|---|---|
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 0 | pp512 | 5689.32 ± 13.35 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 0 | tg128 | 154.53 ± 1.04 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 1 | pp512 | 643.28 ± 31.69 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 1 | tg128 | 64.27 ± 2.21 |
build: 267bf570 (3554)
GGML_CUDA=1 GGML_NO_OPENMP=1 make llama-bench > /dev/null && ./llama-bench -nkvo 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | nkvo | test | t/s |
|---|---|---|---|---|---|---|---|
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 0 | pp512 | 5674.51 ± 37.77 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 0 | tg128 | 153.30 ± 0.48 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 1 | pp512 | 646.42 ± 32.41 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 1 | tg128 | 62.98 ± 2.94 |
build: 267bf570 (3554)
GGML_BLIS=1 make llama-bench > /dev/null && ./llama-bench -p 128 -n 32
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | BLAS | 16 | pp128 | 47.55 ± 0.17 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | BLAS | 16 | tg32 | 20.79 ± 0.10 |
build: 267bf570 (3554)
GGML_BLIS=1 GGML_NO_OPENMP=1 make llama-bench > /dev/null && ./llama-bench -p 128 -n 32
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | BLAS | 16 | pp128 | 33.47 ± 0.48 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | BLAS | 16 | tg32 | 20.58 ± 0.07 |
build: 267bf570 (3554)
| CPU | Model | Threads | Test | t/s master | t/s threadpool | Speedup |
|---|---|---|---|---|---|---|
| M3 Max | llama 7B all F32 | 4 | pp512 | 150.03 | 134.77 | 0.90 |
| M3 Max | llama 7B all F32 | 4 | tg128 | 4.76 | 4.20 | 0.88 |
| M3 Max | llama 7B all F32 | 8 | pp512 | 155.66 | 115.40 | 0.74 |
| M3 Max | llama 7B all F32 | 8 | tg128 | 4.76 | 4.35 | 0.91 |
| M3 Max | llama 7B all F32 | 12 | pp512 | 156.19 | 94.43 | 0.60 |
| M3 Max | llama 7B all F32 | 12 | tg128 | 4.66 | 4.33 | 0.93 |
| M3 Max | llama 7B Q4_0 | 4 | pp512 | 142.43 | 144.89 | 1.02 |
| M3 Max | llama 7B Q4_0 | 4 | tg128 | 21.04 | 20.74 | 0.99 |
| M3 Max | llama 7B Q4_0 | 8 | pp512 | 150.08 | 142.22 | 0.95 |
| M3 Max | llama 7B Q4_0 | 8 | tg128 | 28.22 | 28.14 | 1.00 |
| M3 Max | llama 7B Q4_0 | 12 | pp512 | 150.55 | 120.62 | 0.80 |
| M3 Max | llama 7B Q4_0 | 12 | tg128 | 30.10 | 30.26 | 1.01 |
| M3 Max | stories260K | 4 | pp512 | 52491.62 | 65492.68 | 1.25 |
| M3 Max | stories260K | 4 | tg128 | 8417.80 | 12262.68 | 1.46 |
| M3 Max | stories260K | 8 | pp512 | 59893.07 | 94300.47 | 1.57 |
| M3 Max | stories260K | 8 | tg128 | 3746.70 | 5639.87 | 1.51 |
| M3 Max | stories260K | 12 | pp512 | 53756.90 | 115958.90 | 2.16 |
| M3 Max | stories260K | 12 | tg128 | 2507.28 | 4333.34 | 1.73 |
The performance looks better now, with nkvo it is comparable to OpenMP, which is very good. There is still a performance drop when using the BLAS backend (this includes the default build in macOS, which uses Accelerate). I suspect that this is because the threads are spinning in
ggml_graph_compute_check_for_workwhile the BLAS backend is running. This will also cause the threads to spin while the GPU backend is running when partially offloading, which would be a reggression. Rather than requiring the user to disable polling manually, I suggest implementing some kind of backoff and yield the threads after spinning for a while. The BLAS backend (ggml-blas.cpp) may also benefit from using the threadpool, since it launches several threads to dequantize the weights, and it could also automatically pause the pool during the call to the BLAS library.
@slaren
Awesome! Thanks for checking out the latest. We've been doing lots of profiling and tuning. Every time I'm about to send an updated perf report on Snapdragons and M2 I find yet another thing to improve :) In my testing we're doing really well with the CPU backend (especially on the ARM64-based systems), with other backends, as you pointed out, the spinning threads get in the way at times and cause regressions. I'll try your suggestions.
btw We might just flip the default back to non-polling. Technically polling is only useful for the llama-bench to match OpenMP behavior/numbers in that case. When I looked at the original profiles, I saw that the threadpool is doing a lot more context switches than OpenMP during token-gen test. Polling removes those context switches and we get even better numbers now. It might make sense to make that a bit of a special case (ie default to polling for the CPU backend bench, otherwise default is non-polling) or some hybrid approach as you suggested.
@slaren @fmz
I managed to further improve the threadpool signaling (reducing the number of wake-ups, etc) and also introduced the hybrid polling mode which is now the default.
--poll now sets the polling level which is basically how aggressively we poll. 0 means no polling, 1 means around 128K polling rounds then cond.wait, 2x128K rounds, etc. The default is 50 which seems to work well on the machines I got here (see the report).
The regression with the Metal backend should be fixed now (see the report below).
The BLIS backend will need some further tuning. Though I wonder how useful it is given how much slower it is compared to plain CPU backend with the latest CPU features.
I included latest llama-bench and few simple llama-cli results for M2 Max, AMD EPYC 7543, Snapdragon X-Elite and Snapgragon Gen 3 with LLama v3.1 8B and a smaller LLama-based 314M model (generated using https://arxiv.org/pdf/2403.00858)
Results
M2 Max (default build)
make clean; make llama-bench
~/src/llama.cpp-master$ ./llama-bench -m ../gguf/llama-v3.1.q4_0.gguf -p 128 -n 32
| model | size | params | backend | ngl | test | t/s |
|---|---|---|---|---|---|---|
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | Metal | 99 | pp128 | 469.51 ± 1.07 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | Metal | 99 | tg32 | 58.92 ± 0.22 |
build: 3071c0a5 (3557)
~/src/llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v3.1.q4_0.gguf -p 128 -n 32
| model | size | params | backend | ngl | test | t/s |
|---|---|---|---|---|---|---|
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | Metal | 99 | pp128 | 470.04 ± 0.25 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | Metal | 99 | tg32 | 58.78 ± 0.18 |
build: 323181f2 (3573)
M2 Max (llvm build to enable MATMUL_INT8)
CC=clang CXX=clang++ CFLAGS="-march=armv8.7-a" CXXFLAGS="-march=armv8.7-a" GGML_NO_METAL=1 GGML_NO_ACCELERATE=1 make -j32 llama-bench llama-cli
- Note Q4_0_4_X is broken with ACCELERATE, but BLIS and ACCELERATE are much slower anyway *
~/src/llama.cpp-master$ ./llama-bench -m ../gguf/llama-v3.1.q4_0_4_8.gguf -t 4,6,8
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 4 | pp512 | 63.69 ± 0.37 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 4 | tg128 | 22.51 ± 0.11 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 6 | pp512 | 90.60 ± 1.19 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 6 | tg128 | 24.73 ± 0.06 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 8 | pp512 | 112.72 ± 2.76 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 8 | tg128 | 25.21 ± 0.86 |
build: 3071c0a5 (3557)
~/src/llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v3.1.q4_0_4_8.gguf -t 4,6,8
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 4 | pp512 | 65.59 ± 0.70 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 4 | tg128 | 24.80 ± 0.13 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 6 | pp512 | 92.72 ± 1.75 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 6 | tg128 | 26.12 ± 0.12 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 8 | pp512 | 116.72 ± 1.33 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 8 | tg128 | 26.34 ± 0.05 |
build: 323181f2 (3573)
~/src/llama.cpp-master$ ./llama-bench -m ../gguf/llama-v3-314m.q4_0_4_8.gguf -t 4,6,8
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 4 | pp512 | 6216.28 ± 206.77 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 4 | tg128 | 405.77 ± 0.85 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 6 | pp512 | 9223.23 ± 105.64 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 6 | tg128 | 544.30 ± 0.48 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 8 | pp512 | 12073.97 ± 76.67 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 8 | tg128 | 616.44 ± 1.81 |
build: 3071c0a5 (3557)
~/src/llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v3-314m.q4_0_4_8.gguf -t 4,6,8
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 4 | pp512 | 6680.06 ± 70.44 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 4 | tg128 | 503.61 ± 1.44 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 6 | pp512 | 9431.40 ± 13.32 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 6 | tg128 | 638.16 ± 4.39 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 8 | pp512 | 12292.80 ± 40.62 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 8 | tg128 | 674.69 ± 12.01 |
build: 323181f2 (3573)
M2 Max (BLIS backend)
~/src/llama.cpp-master$ ./llama-bench -m ../gguf/llama-v3.1.q4_0.gguf -p 64 -n 16
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | BLAS | 8 | pp64 | 48.69 ± 0.08 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | BLAS | 8 | tg16 | 24.68 ± 0.10 |
build: 3071c0a5 (3557)
~/src/llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v3.1.q4_0.gguf -p 64 -n 16
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | BLAS | 8 | pp64 | 35.90 ± 0.97 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | BLAS | 8 | tg16 | 24.68 ± 0.32 |
build: 323181f2 (3573)
~/src/llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v3.1.q4_0.gguf -p 64 -n 16 --poll 0
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | BLAS | 8 | pp64 | 46.44 ± 2.06 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | BLAS | 8 | tg16 | 24.81 ± 0.03 |
build: 323181f2 (3573)
AMD EPYC (default build)
- Note Q4_K has the best perf on the EPYC *
make -j32 llama-bench
llama.cpp-master$ ./llama-bench -m ../gguf/llama-v3.1-8B.q4_k_s-pure.gguf -t 16,32,64 -p 64 -n 16
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 8B Q4_K - Small | 4.21 GiB | 8.03 B | CPU | 16 | pp64 | 63.11 ± 0.14 |
| llama 8B Q4_K - Small | 4.21 GiB | 8.03 B | CPU | 16 | tg16 | 18.82 ± 0.88 |
| llama 8B Q4_K - Small | 4.21 GiB | 8.03 B | CPU | 32 | pp64 | 116.73 ± 3.24 |
| llama 8B Q4_K - Small | 4.21 GiB | 8.03 B | CPU | 32 | tg16 | 22.86 ± 0.71 |
| llama 8B Q4_K - Small | 4.21 GiB | 8.03 B | CPU | 64 | pp64 | 141.81 ± 0.55 |
| llama 8B Q4_K - Small | 4.21 GiB | 8.03 B | CPU | 64 | tg16 | 21.13 ± 0.34 |
build: 3071c0a5 (3557)
GGML_NO_OPENMP=1 make -j32 llama-bench
llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v3.1-8B.q4_k_s-pure.gguf -t 16,32,64 -p 64 -n 16
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 8B Q4_K - Small | 4.21 GiB | 8.03 B | CPU | 16 | pp64 | 62.82 ± 0.96 |
| llama 8B Q4_K - Small | 4.21 GiB | 8.03 B | CPU | 16 | tg16 | 16.09 ± 0.28 |
| llama 8B Q4_K - Small | 4.21 GiB | 8.03 B | CPU | 32 | pp64 | 110.30 ± 1.07 |
| llama 8B Q4_K - Small | 4.21 GiB | 8.03 B | CPU | 32 | tg16 | 19.15 ± 0.82 |
| llama 8B Q4_K - Small | 4.21 GiB | 8.03 B | CPU | 64 | pp64 | 122.25 ± 5.91 |
| llama 8B Q4_K - Small | 4.21 GiB | 8.03 B | CPU | 64 | tg16 | 19.21 ± 0.47 |
build: 323181f2 (3573)
Real use-case does much better.
llama.cpp-master$ ./llama-cli -m ../gguf/llama-v3.1-8B.q4_k_s-pure.gguf -t 32 --seed 42 -p 'what is the most popular cookie in the world? (prease be brief)' -n 40
...
what is the most popular cookie in the world? (prease be brief) It is chocolate chip cookie. It is a classic favorite and one of the most beloved cookies.
What is the most popular cookie in the world?
According to various sources, including Google Trends and online baking
llama_print_timings: load time = 4399.32 ms
llama_print_timings: sample time = 3.48 ms / 40 runs ( 0.09 ms per token, 11490.95 tokens per second)
llama_print_timings: prompt eval time = 211.88 ms / 16 tokens ( 13.24 ms per token, 75.51 tokens per second)
llama_print_timings: eval time = 2267.88 ms / 39 runs ( 58.15 ms per token, 17.20 tokens per second)
llama_print_timings: total time = 2489.99 ms / 55 tokens
llama.cpp-threadpool$ ./llama-cli -m ../gguf/llama-v3.1-8B.q4_k_s-pure.gguf -t 32 --seed 42 -p 'what is the most popular cookie in the world? (prease be brief)' -n 40
...
what is the most popular cookie in the world? (prease be brief) It is chocolate chip cookie. It is a classic favorite and one of the most beloved cookies.
What is the most popular cookie in the world?
According to various sources, including Google Trends and online baking
llama_print_timings: load time = 4250.79 ms
llama_print_timings: sample time = 2.92 ms / 40 runs ( 0.07 ms per token, 13708.02 tokens per second)
llama_print_timings: prompt eval time = 203.73 ms / 16 tokens ( 12.73 ms per token, 78.54 tokens per second)
llama_print_timings: eval time = 2072.78 ms / 39 runs ( 53.15 ms per token, 18.82 tokens per second)
llama_print_timings: total time = 2285.96 ms / 55 tokens
AMD EPYC (BLIS backend)
llama.cpp-master$ ./llama-bench -m ../gguf/llama-v3.1-8B.q4_k_s-pure.gguf -t 32 -p 64 -n 16
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 8B Q4_K - Small | 4.21 GiB | 8.03 B | BLAS | 32 | pp64 | 29.85 ± 0.25 |
| llama 8B Q4_K - Small | 4.21 GiB | 8.03 B | BLAS | 32 | tg16 | 19.71 ± 0.63 |
build: 3071c0a5 (3557)
llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v3.1-8B.q4_k_s-pure.gguf -t 32 -p 64 -n 16
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 8B Q4_K - Small | 4.21 GiB | 8.03 B | BLAS | 32 | pp64 | 21.50 ± 1.53 |
| llama 8B Q4_K - Small | 4.21 GiB | 8.03 B | BLAS | 32 | tg16 | 17.49 ± 0.17 |
Snapdragon X-Elite (default llvm-windows build)
llama.cpp-master ./build-arm64-windows-llvm-release/bin/llama-bench.exe -m ../gguf/llama-v3.1.q4_0_4_8.gguf -t 4,6,8,10,12
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 4 | pp512 | 70.06 ± 0.06 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 4 | tg128 | 21.20 ± 0.11 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 6 | pp512 | 100.08 ± 1.78 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 6 | tg128 | 21.84 ± 0.15 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 8 | pp512 | 125.61 ± 1.52 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 8 | tg128 | 19.30 ± 3.68 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 10 | pp512 | 144.20 ± 5.02 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 10 | tg128 | 22.60 ± 0.21 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 12 | pp512 | 175.83 ± 5.59 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 12 | tg128 | 9.83 ± 7.34 |
build: 3071c0a5 (3557)
~/src/llama.cpp-threadpool ./build-arm64-windows-llvm-release/bin/llama-bench.exe -m ../gguf/llama-v3.1.q4_0_4_8.gguf -t 4,6,8,10,12
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 4 | pp512 | 70.38 ± 0.22 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 4 | tg128 | 21.71 ± 0.17 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 6 | pp512 | 100.66 ± 1.74 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 6 | tg128 | 23.14 ± 0.20 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 8 | pp512 | 126.16 ± 2.03 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 8 | tg128 | 22.91 ± 0.09 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 10 | pp512 | 151.48 ± 3.02 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 10 | tg128 | 23.61 ± 0.22 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 12 | pp512 | 185.21 ± 2.95 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 12 | tg128 | 21.82 ± 1.44 |
build: 323181f2 (3573)
llama.cpp-master ./build-arm64-windows-llvm-release/bin/llama-bench.exe -m ../gguf/llama-v3-314m.q4_0_4_8.gguf -t 4,6,8,10,12
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 4 | pp512 | 5352.91 ± 17.93 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 4 | tg128 | 345.11 ± 1.41 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 6 | pp512 | 7660.97 ± 77.94 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 6 | tg128 | 405.47 ± 5.90 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 8 | pp512 | 9699.85 ± 62.08 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 8 | tg128 | 438.34 ± 5.46 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 10 | pp512 | 11651.56 ± 158.28 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 10 | tg128 | 436.98 ± 4.93 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 12 | pp512 | 12893.53 ± 943.14 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 12 | tg128 | 408.23 ± 7.20 |
build: 3071c0a5 (3557)
~/src/llama.cpp-threadpool ./build-arm64-windows-llvm-release/bin/llama-bench.exe -m ../gguf/llama-v3-314m.q4_0_4_8.gguf -t 4,6,8,10,12
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 4 | pp512 | 5378.80 ± 8.04 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 4 | tg128 | 360.03 ± 1.04 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 6 | pp512 | 7747.20 ± 62.34 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 6 | tg128 | 530.54 ± 4.81 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 8 | pp512 | 9895.07 ± 49.93 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 8 | tg128 | 626.93 ± 4.11 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 10 | pp512 | 11836.86 ± 111.19 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 10 | tg128 | 667.55 ± 6.24 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 12 | pp512 | 13231.19 ± 524.42 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 12 | tg128 | 653.97 ± 21.23 |
build: 323181f2 (3573)
llama.cpp-master
./build-arm64-windows-llvm-release/bin/llama-cli.exe --no-mmap -m ../gguf/llama-v3.1.q4_0_4_8.gguf -tb 10 -t 6 --ctx-size 2048 --seed 42 -p '<|begin_of_text|> what is the most popular cookie in the world? (please be brief)' -n 64
...
what is the most popular cookie in the world? (please be brief)
The chocolate chip cookie is the most popular cookie in the world. (according to various sources)
Note: This answer is brief because the question asks for a brief response.
Let me know if you'd like me to elaborate!
(If you'd like more information, I'd be happy to provide some
llama_print_timings: load time = 1164.20 ms
llama_print_timings: sample time = 3.58 ms / 64 runs ( 0.06 ms per token, 17892.09 tokens per second)
llama_print_timings: prompt eval time = 90.19 ms / 16 tokens ( 5.64 ms per token, 177.41 tokens per second)
llama_print_timings: eval time = 3038.12 ms / 63 runs ( 48.22 ms per token, 20.74 tokens per second)
llama_print_timings: total time = 3141.14 ms / 79 tokens
llama.cpp-threadpool
./build-arm64-windows-llvm-release/bin/llama-cli.exe --no-mmap -m ../gguf/llama-v3.1.q4_0_4_8.gguf -tb 10 -t 6 --ctx-size 2048 --seed 42 -p '<|begin_of_text|> what is the most popular cookie in the world? (please be brief)' -n 64
...
what is the most popular cookie in the world? (please be brief)
The chocolate chip cookie is the most popular cookie in the world. (according to various sources)
Note: This answer is brief because the question asks for a brief response.
Let me know if you'd like me to elaborate!
(If you'd like more information, I'd be happy to provide some
llama_print_timings: load time = 1168.97 ms
llama_print_timings: sample time = 3.74 ms / 64 runs ( 0.06 ms per token, 17103.15 tokens per second)
llama_print_timings: prompt eval time = 87.27 ms / 16 tokens ( 5.45 ms per token, 183.33 tokens per second)
llama_print_timings: eval time = 2940.82 ms / 63 runs ( 46.68 ms per token, 21.42 tokens per second)
llama_print_timings: total time = 3042.01 ms / 79 tokens
Snapdragon Gen 3 (Galaxy S24 Ultra)
Default Android NDK build using the following CMake preset
{
"name": "arm64-android",
"cacheVariables": {
"ANDROID_ABI": "arm64-v8a",
"ANDROID_PLATFORM": "android-31",
"CMAKE_TOOLCHAIN_FILE": "$env{NDK}/build/cmake/android.toolchain.cmake",
"CMAKE_C_FLAGS": "-march=armv8.7a -fvectorize -ffp-model=fast -fno-finite-math-only -flto -D_GNU_SOURCE",
"CMAKE_CXX_FLAGS": "-march=armv8.7a -fvectorize -ffp-model=fast -fno-finite-math-only -flto -D_GNU_SOURCE",
"CMAKE_C_FLAGS_RELEASE": "-O3 -DNDEBUG",
"CMAKE_CXX_FLAGS_RELEASE": "-O3 -DNDEBUG"
}
}
threadpool branch is built with -D GGML_OPENMP=OFF
adb shell "cd /data/local/tmp/lmcp; ./run-bench.sh master llama-v3.1.q4_0_4_8.gguf -t 4,6 -p 32 -n 16"
export 'LD_LIBRARY_PATH=/data/local/tmp/lmcp/master'
taskset fc ./master/llama-bench --mmap 0 --n-gpu-layers 0 -m llama-v3.1.q4_0_4_8.gguf -t 4,6 -p 32 -n 16
| model | size | params | backend | threads | mmap | test | t/s |
|---|---|---|---|---|---|---|---|
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 4 | 0 | pp32 | 40.43 ± 1.08 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 4 | 0 | tg16 | 10.29 ± 0.13 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 6 | 0 | pp32 | 48.27 ± 0.24 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 6 | 0 | tg16 | 10.23 ± 0.13 |
adb shell "cd /data/local/tmp/lmcp; ./run-bench.sh threadpool llama-v3.1.q4_0_4_8.gguf -t 4,6 -p 32 -n 16"`
export 'LD_LIBRARY_PATH=/data/local/tmp/lmcp/threadpool'
taskset fc ./threadpool/llama-bench --mmap 0 --n-gpu-layers 0 -m llama-v3.1.q4_0_4_8.gguf -t 4,6 -p 32 -n 16
| model | size | params | backend | threads | mmap | test | t/s |
|---|---|---|---|---|---|---|---|
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 4 | 0 | pp32 | 40.53 ± 0.77 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 4 | 0 | tg16 | 10.59 ± 0.16 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 6 | 0 | pp32 | 48.69 ± 0.31 |
| llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 6 | 0 | tg16 | 10.40 ± 0.13 |
adb shell "cd /data/local/tmp/lmcp; ./run-bench.sh master llama-v3-314m.q4_0_4_8.gguf -t 4,6 -p 32 -n 16"`
export 'LD_LIBRARY_PATH=/data/local/tmp/lmcp/master'
taskset fc ./master/llama-bench --mmap 0 --n-gpu-layers 0 -m llama-v3-314m.q4_0_4_8.gguf -t 4,6 -p 32 -n 16
| model | size | params | backend | threads | mmap | test | t/s |
|---|---|---|---|---|---|---|---|
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 4 | 0 | pp32 | 4006.52 ± 72.66 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 4 | 0 | tg16 | 303.58 ± 5.96 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 6 | 0 | pp32 | 5065.32 ± 82.08 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 6 | 0 | tg16 | 302.77 ± 20.62 |
adb shell "cd /data/local/tmp/lmcp; ./run-bench.sh threadpool llama-v3-314m.q4_0_4_8.gguf -t 4,6 -p 32 -n 16"`
export 'LD_LIBRARY_PATH=/data/local/tmp/lmcp/threadpool'
taskset fc ./threadpool/llama-bench --mmap 0 --n-gpu-layers 0 -m llama-v3-314m.q4_0_4_8.gguf -t 4,6 -p 32 -n 16
| model | size | params | backend | threads | mmap | test | t/s |
|---|---|---|---|---|---|---|---|
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 4 | 0 | pp32 | 4097.12 ± 22.80 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 4 | 0 | tg16 | 312.81 ± 1.88 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 6 | 0 | pp32 | 5127.89 ± 35.28 |
| llama ?B Q4_0_4_8 | 200.79 MiB | 314.06 M | CPU | 6 | 0 | tg16 | 337.77 ± 6.57 |
Edit: I totally forgot that GGML_OPENMP is disabled only for cmake builds... So the numbers below are openmp only. (interesting that there is any change at all...)
@slaren @max-krasnyansky latest CUDA numbers:
Stories260K: $ ./scripts/compare-llama-bench.py -b master -c threadpool
| GPU | Model | NKVO | Test | t/s master | t/s threadpool | Speedup |
|---|---|---|---|---|---|---|
| RTX 3060 Laptop GPU | llama ?B all F32 (guessed) | No | pp512 | 199949.37 | 199425.82 | 1.00 |
| RTX 3060 Laptop GPU | llama ?B all F32 (guessed) | No | tg128 | 2472.27 | 2585.31 | 1.05 |
| RTX 3060 Laptop GPU | llama ?B all F32 (guessed) | Yes | pp512 | 12503.69 | 12627.24 | 1.01 |
| RTX 3060 Laptop GPU | llama ?B all F32 (guessed) | Yes | tg128 | 1632.84 | 1642.70 | 1.01 |
llamav2 7B:
| GPU | Model | NKVO | Test | t/s master | t/s threadpool | Speedup |
|---|---|---|---|---|---|---|
| RTX 3060 Laptop GPU | llama 7B Q4_0 | No | pp512 | 1654.38 | 1658.14 | 1.00 |
| RTX 3060 Laptop GPU | llama 7B Q4_0 | No | tg128 | 66.71 | 66.82 | 1.00 |
| RTX 3060 Laptop GPU | llama 7B Q4_0 | Yes | pp512 | 288.97 | 295.97 | 1.02 |
| RTX 3060 Laptop GPU | llama 7B Q4_0 | Yes | tg128 | 54.52 | 54.90 | 1.01 |
This is with openmp disabled on the threadpool branch:
$ LLAMA_CUDA=1 ./scripts/compare-commits.sh master threadpool -nkvo 0,1 -m models/7B/llama7b.gguf
| GPU | Model | NKVO | Test | t/s master | t/s threadpool | Speedup |
|---|---|---|---|---|---|---|
| RTX 3060 Laptop GPU | llama 7B Q4_0 | No | pp512 | 1657.23 | 1659.78 | 1.00 |
| RTX 3060 Laptop GPU | llama 7B Q4_0 | No | tg128 | 66.77 | 66.24 | 0.99 |
| RTX 3060 Laptop GPU | llama 7B Q4_0 | Yes | pp512 | 302.17 | 301.35 | 1.00 |
| RTX 3060 Laptop GPU | llama 7B Q4_0 | Yes | tg128 | 55.02 | 54.87 | 1.00 |
can confirm it's slightly worse on stories 260K:
| GPU | Model | NKVO | Test | t/s master | t/s threadpool | Speedup |
|---|---|---|---|---|---|---|
| RTX 3060 Laptop GPU | llama ?B all F32 (guessed) | No | pp512 | 199562.93 | 193086.43 | 0.97 |
| RTX 3060 Laptop GPU | llama ?B all F32 (guessed) | No | tg128 | 2517.85 | 2399.51 | 0.95 |
| RTX 3060 Laptop GPU | llama ?B all F32 (guessed) | Yes | pp512 | 12702.00 | 12819.76 | 1.01 |
| RTX 3060 Laptop GPU | llama ?B all F32 (guessed) | Yes | tg128 | 1646.64 | 1628.08 | 0.99 |
@slaren
@fmz and I worked on further improvements (removing special cases, reducing branches, etc) and at this point it seems like it should be good to merge.
I believe the BLAS/BLIS backend might need further work. I took a look at it and realized that ggml-blas.c wants a generic threadpool that executes arbitrary functions. The threadpool we've added so far is designed specifically for graph_compute. It's of course possible to update it and make it more generic, assuming there is interest in updating the BLAS/BLIS backend. From my testing it seems to be generally much slower, so I'm not sure how much we want to invest into it. Perhaps, we can just add a check in the make/cmake that BLAS backend requires OMP for now?
Perf numbers on the Snapgragons and the M2 are a bit better but overall similar to what I shared above, the perf profiles are looking cleaner though, things like total branches, missed branches, etc.
Here is a fresh run from Ryzen 9 3950X + RTX 3080 Ubuntu 22.04
Testing that nkvo scenario that you had regressions with before.
GGML_CUDA=1 GGML_NO_OPENMP=1 make -j16 llama-cli llama-bench
llama.cpp-master$ nice -20 ./llama-bench -m ../gguf/llama-v3.1.q4_0.gguf -ngl 0,99 -nkvo 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | nkvo | test | t/s |
|---|---|---|---|---|---|---|---|
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CUDA | 0 | 0 | pp512 | 1090.70 ± 22.64 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CUDA | 0 | 0 | tg128 | 10.01 ± 0.01 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CUDA | 0 | 1 | pp512 | 527.11 ± 0.31 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CUDA | 0 | 1 | tg128 | 10.03 ± 0.00 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CUDA | 99 | 0 | pp512 | 4384.25 ± 13.19 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CUDA | 99 | 0 | tg128 | 110.16 ± 0.12 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CUDA | 99 | 1 | pp512 | 798.64 ± 6.35 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CUDA | 99 | 1 | tg128 | 95.40 ± 0.15 |
build: 06943a69 (3581)
GGML_CUDA=1 GGML_NO_OPENMP=1 make -j16 llama-cli llama-bench
llama.cpp-threadpool$ nice -20 ./llama-bench -m ../gguf/llama-v3.1.q4_0.gguf -ngl 0,99 -nkvo 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | nkvo | test | t/s |
|---|---|---|---|---|---|---|---|
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CUDA | 0 | 0 | pp512 | 1114.06 ± 0.45 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CUDA | 0 | 0 | tg128 | 9.97 ± 0.02 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CUDA | 0 | 1 | pp512 | 534.33 ± 0.28 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CUDA | 0 | 1 | tg128 | 9.99 ± 0.01 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CUDA | 99 | 0 | pp512 | 4369.85 ± 8.77 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CUDA | 99 | 0 | tg128 | 109.93 ± 0.11 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CUDA | 99 | 1 | pp512 | 824.77 ± 6.40 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CUDA | 99 | 1 | tg128 | 96.29 ± 0.22 |
build: 9cd5a61d (3599)
I see that one of the server tests failed in the CI. I just ran the same thing locally and can't reproduce the failure. Will keep an eye on it.
The BLAS backend is still important at least in macOS because Accelerate is significantly faster. OpenMP is also not available in macOS. In my opinion this is the most important use case, because other platforms have access to good implementations of OpenMP.
I think I hit a deadlock when testing with LLAMA_DEBUG=1 GGML_BLIS=1 GGML_NO_OPENMP=1 GGML_NO_LLAMAFILE=1:
./llama-bench -m models/stories260K.gguf -r 10 -t 16
Note: 16 other threads idling in OpenMP (from BLIS) omitted .
Thread 16 (Thread 0x79f5dfbed6c0 (LWP 22390) "llama-bench"):
#0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335
#1 __cpu_relax () at ggml/src/ggml.c:3060
#2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142
#3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabf428) at ggml/src/ggml.c:19206
#4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabf428) at ggml/src/ggml.c:19296
#5 0x000079f5e8a97b5a in start_thread (arg=
Thread 15 (Thread 0x79f5e03ee6c0 (LWP 22389) "llama-bench"):
#0 0x00005799cc04587a in __cpu_relax () at ggml/src/ggml.c:3061
#1 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142
#2 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabf200) at ggml/src/ggml.c:19206
#3 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabf200) at ggml/src/ggml.c:19296
#4 0x000079f5e8a97b5a in start_thread (arg=
Thread 14 (Thread 0x79f5e0bef6c0 (LWP 22388) "llama-bench"):
#0 ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3139
#1 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabefd8) at ggml/src/ggml.c:19206
#2 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabefd8) at ggml/src/ggml.c:19296
#3 0x000079f5e8a97b5a in start_thread (arg=
Thread 13 (Thread 0x79f5e13f06c0 (LWP 22387) "llama-bench"):
#0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335
#1 __cpu_relax () at ggml/src/ggml.c:3060
#2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142
#3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabedb0) at ggml/src/ggml.c:19206
#4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabedb0) at ggml/src/ggml.c:19296
#5 0x000079f5e8a97b5a in start_thread (arg=
Thread 12 (Thread 0x79f5e1bf16c0 (LWP 22386) "llama-bench"):
#0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335
#1 __cpu_relax () at ggml/src/ggml.c:3060
#2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142
#3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabeb88) at ggml/src/ggml.c:19206
#4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabeb88) at ggml/src/ggml.c:19296
#5 0x000079f5e8a97b5a in start_thread (arg=
Thread 11 (Thread 0x79f5e23f26c0 (LWP 22385) "llama-bench"):
#0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335
#1 __cpu_relax () at ggml/src/ggml.c:3060
#2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142
#3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabe960) at ggml/src/ggml.c:19206
#4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabe960) at ggml/src/ggml.c:19296
#5 0x000079f5e8a97b5a in start_thread (arg=
Thread 10 (Thread 0x79f5e2bf36c0 (LWP 22384) "llama-bench"):
#0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335
#1 __cpu_relax () at ggml/src/ggml.c:3060
#2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142
#3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabe738) at ggml/src/ggml.c:19206
#4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabe738) at ggml/src/ggml.c:19296
#5 0x000079f5e8a97b5a in start_thread (arg=
Thread 9 (Thread 0x79f5e33f46c0 (LWP 22383) "llama-bench"):
#0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335
#1 __cpu_relax () at ggml/src/ggml.c:3060
#2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142
#3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabe510) at ggml/src/ggml.c:19206
#4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabe510) at ggml/src/ggml.c:19296
#5 0x000079f5e8a97b5a in start_thread (arg=
Thread 8 (Thread 0x79f5e3bf56c0 (LWP 22382) "llama-bench"):
#0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335
#1 __cpu_relax () at ggml/src/ggml.c:3060
#2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142
#3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabe2e8) at ggml/src/ggml.c:19206
#4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabe2e8) at ggml/src/ggml.c:19296
#5 0x000079f5e8a97b5a in start_thread (arg=
Thread 7 (Thread 0x79f5e43f66c0 (LWP 22381) "llama-bench"):
#0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335
#1 __cpu_relax () at ggml/src/ggml.c:3060
#2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142
#3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabe0c0) at ggml/src/ggml.c:19206
#4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabe0c0) at ggml/src/ggml.c:19296
#5 0x000079f5e8a97b5a in start_thread (arg=
Thread 6 (Thread 0x79f5e4bf76c0 (LWP 22380) "llama-bench"):
#0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335
#1 __cpu_relax () at ggml/src/ggml.c:3060
#2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142
#3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabde98) at ggml/src/ggml.c:19206
#4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabde98) at ggml/src/ggml.c:19296
#5 0x000079f5e8a97b5a in start_thread (arg=
Thread 5 (Thread 0x79f5e53f86c0 (LWP 22379) "llama-bench"):
#0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335
#1 __cpu_relax () at ggml/src/ggml.c:3060
#2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142
#3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabdc70) at ggml/src/ggml.c:19206
#4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabdc70) at ggml/src/ggml.c:19296
#5 0x000079f5e8a97b5a in start_thread (arg=
Thread 4 (Thread 0x79f5e5bf96c0 (LWP 22378) "llama-bench"):
#0 0x00005799cc04587a in __cpu_relax () at ggml/src/ggml.c:3061
#1 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142
#2 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabda48) at ggml/src/ggml.c:19206
#3 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabda48) at ggml/src/ggml.c:19296
#4 0x000079f5e8a97b5a in start_thread (arg=
Thread 3 (Thread 0x79f5e63fa6c0 (LWP 22377) "llama-bench"):
#0 0x00005799cc04587a in __cpu_relax () at ggml/src/ggml.c:3061
#1 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142
#2 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabd820) at ggml/src/ggml.c:19206
#3 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabd820) at ggml/src/ggml.c:19296
#4 0x000079f5e8a97b5a in start_thread (arg=
Thread 2 (Thread 0x79f5e6bfb6c0 (LWP 22376) "llama-bench"):
#0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335
#1 __cpu_relax () at ggml/src/ggml.c:3060
#2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142
#3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabd5f8) at ggml/src/ggml.c:19206
#4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabd5f8) at ggml/src/ggml.c:19296
#5 0x000079f5e8a97b5a in start_thread (arg=
Thread 1 (Thread 0x79f5e9dc4c40 (LWP 22375) "llama-bench"): #0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335 #1 __cpu_relax () at ggml/src/ggml.c:3060 #2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabd3d0) at ggml/src/ggml.c:19206 #4 0x00005799cc0813a2 in ggml_graph_compute (cgraph=0x5799cda66998, cplan=0x7ffc9d6a28a0) at ggml/src/ggml.c:19502 #5 0x00005799cc092a1a in ggml_backend_cpu_graph_compute (backend=0x5799cda872b0, cgraph=0x5799cda66998) at ggml/src/ggml-backend.c:817 #6 0x00005799cc091807 in ggml_backend_graph_compute_async (backend=0x5799cda872b0, cgraph=0x5799cda66998) at ggml/src/ggml-backend.c:282 #7 0x00005799cc0966b4 in ggml_backend_sched_compute_splits (sched=0x5799cda5f370) at ggml/src/ggml-backend.c:1805 #8 0x00005799cc0972c8 in ggml_backend_sched_graph_compute_async (sched=0x5799cda5f370, graph=0x79f5e82df030) at ggml/src/ggml-backend.c:1992 #9 0x00005799cc1295a8 in llama_graph_compute (lctx=..., gf=0x79f5e82df030, n_threads=16, threadpool=0x5799cda65990) at src/llama.cpp:14527 #10 0x00005799cc12a344 in llama_decode_internal (lctx=..., batch_all=...) at src/llama.cpp:14781 #11 0x00005799cc1382f1 in llama_decode (ctx=0x5799cda5fe80, batch=...) at src/llama.cpp:18600 #12 0x00005799cc354d9b in test_prompt (ctx=0x5799cda5fe80, n_prompt=512, n_past=0, n_batch=2048, n_threads=16) at examples/llama-bench/llama-bench.cpp:1349 #13 0x00005799cc3558a6 in main (argc=7, argv=0x7ffc9d6a3798) at examples/llama-bench/llama-bench.cpp:1485
@slaren
The BLAS backend is still important at least in macOS because Accelerate is significantly faster. OpenMP is also not available in macOS. In my opinion this is the most important use case, because other platforms have access to good implementations of OpenMP.
Got it. Makes sense for the BLAS then.
For other platforms there are several other advantages of using dedicated threadpool vs OpenMP. Things like ability to specify the affinity masks, priorities, etc specific to llama.cpp/ggml instances. With OpenMP those settings are global per process. ie if an app that links to libllama.so/libggml.so uses OpenMP for other stuff (say it's linking some other lib that uses OpenMP) then the settings conflict with each other. There are other things like being able to reuse threadpools between llama_ctx, reduced dependencies, etc.
I think I hit a deadlock when testing with
LLAMA_DEBUG=1 GGML_BLIS=1 GGML_NO_OPENMP=1 GGML_NO_LLAMAFILE=1:
./llama-bench -m models/stories260K.gguf -r 10 -t 16
Oh. Odd. I thought I tested that use-case. Will follow up asap.
@slaren
Sorry for not catching this earlier. The timing had to be just right to trigger that race.
I reproduced it with
while true; do ./llama-bench -m ../gguf/stories260K.gguf -r 10 -t 16; done on my Ryzen9 system.
Fixed it, and let that loop run for a couple of hours to make sure.
Do you think we can merge this more or less as is and then work on extending things to accommodate the BLAS backend as a follow up? See above for general benefits vs OpenMP, and it speeds things up on the ARM64 CPUs, see my report above. It'd be good to get Windows on ARM, and Android releases going with the threadpool enabled. We'll definitely follow up on the BLAS and there are further ideas as well (reusing temporary pools, etc).
@slaren
Another quick question. ggml-blas.cpp is C++ and is using C++11 stuff like std::future when OpenMP is disabled.
Would it be OK to do Thread Pool V3 in C++? We can add some extern "C" APIs to call from ggml.c but it'd be nice if the core threadpool logic was in C++ (with clean std::atomic, std::thread, ...). This way we could remove pthread wrappers and things, we'll still need a few OS specific functions for the CPU affinity and priority stuff but the core bits will just be clean C++11.
We'd create ggml-thread.cpp and implement all threading/cpu/numa related stuff in there, again with some extern "C" APIs for the rest of GGML.
We could do this as the followup to this current Thread Pool V2 version. Please see the question/suggestion above.
The deadlock also seems to be fixed here. I think we can merge if there aren't any significant performance regressions. I will do a more in depth review in the following days, so far I have only looked at the performance. Using C++ would be good as long as the public ggml interface remains compatible with C, in the future we will probably continue porting parts of ggml to C++.
As an extra data point: I'm not seeing a performance regression on this branch on my EPYC system. I'm seeing a single-digit percentage speedup vs master, in fact.
@ggerganov @slaren Do you have any more suggestions/comments/concerns regarding this PR? I would suggest we merge it in and create issues to track BLAS/BLIS improvements and/or moving to C++ synchronization primitives
Not critical, but I noticed that there is a performance regression with partial offloading (-ngl 10) with Metal, at least with small models:
scripts/compare-commits.sh master threadpool -m models/tinyllama-1.1b-intermediate-step-480k-1t.Q8_0.gguf -m models/llama-2-7b/ggml-model-Q4_0.gguf -t 4,8 -ngl 10
| CPU | Model | Model Size [GiB] | Threads | Test | t/s master | t/s threadpool | Speedup |
|---|---|---|---|---|---|---|---|
| M3 Max | llama 1B Q8_0 | 1.09 | 4 | pp512 | 1163.31 | 1067.73 | 0.92 |
| M3 Max | llama 1B Q8_0 | 1.09 | 4 | tg128 | 102.36 | 83.20 | 0.81 |
| M3 Max | llama 1B Q8_0 | 1.09 | 8 | pp512 | 1292.88 | 1184.32 | 0.92 |
| M3 Max | llama 1B Q8_0 | 1.09 | 8 | tg128 | 104.22 | 89.58 | 0.86 |
| M3 Max | llama 7B Q4_0 | 3.56 | 4 | pp512 | 185.83 | 185.25 | 1.00 |
| M3 Max | llama 7B Q4_0 | 3.56 | 4 | tg128 | 24.28 | 24.21 | 1.00 |
| M3 Max | llama 7B Q4_0 | 3.56 | 8 | pp512 | 194.85 | 191.47 | 0.98 |
| M3 Max | llama 7B Q4_0 | 3.56 | 8 | tg128 | 30.21 | 31.09 | 1.03 |