llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Threadpool: take 2

Open fmz opened this issue 1 year ago • 17 comments

ref: original PR #7526

Added an API to support explicit management and fine-grain control of threadpools. The API supports creating different threadpools for various parts of execution, e.g. batch, single-token, etc. Each threadpool can be created, paused, resumed, and released independently from any other threadpools. This mitigates the overhead of starting/stopping threads for each decode call and helps OSes keep track of scheduling history in order to make better scheduling decisions.

Each threadpool supports:

Setting number of threads (duh) Setting a CPU mask for threads to be placed on Support for strict/relaxed placement: pinning specific threads to specific cores, or letting the OS decide Support for polling/interrupt-driven wait Setting thread priority Using threadpools explicitly is optional. If a llama_decode is called with a llama_context that doesn't have a threadpool attached, a disposable threadpool is created (same as the current behavior). If users choose to explicitly use threadpools, they have to manage them manually. See example in main.cpp.

With all the bells and whistles enabled, we generally see a minor improvement vs OMP. Without polling, threadpool runs on par with OMP.

fmz avatar Jul 24 '24 15:07 fmz

Here are some perf figures:

On W-2225 Xeon machine: CPU backend:

CPU Model Test t/s master t/s threadpool Speedup
Intel(R) Xeon(R) W-2225 CPU @ 4.10GHz llama 7B Q4_0 pp512 17.46 17.51 1.00
Intel(R) Xeon(R) W-2225 CPU @ 4.10GHz llama 7B Q4_0 tg128 6.98 7.06 1.01

Intel 10th-gen CPU: ./scripts/compare-commits.sh master threadpool -t 1,2,4,6,8,10

CPU Model Threads Test t/s master t/s threadpool-attempt-2 Speedup
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz llama 7B Q4_0 1 pp512 3.93 3.94 1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz llama 7B Q4_0 1 tg128 2.43 2.44 1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz llama 7B Q4_0 2 pp512 7.13 7.06 0.99
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz llama 7B Q4_0 2 tg128 4.37 4.36 1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz llama 7B Q4_0 4 pp512 11.96 11.99 1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz llama 7B Q4_0 4 tg128 6.79 6.77 1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz llama 7B Q4_0 6 pp512 14.96 14.98 1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz llama 7B Q4_0 6 tg128 7.51 7.53 1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz llama 7B Q4_0 8 pp512 13.06 13.09 1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz llama 7B Q4_0 8 tg128 6.88 6.83 0.99
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz llama 7B Q4_0 10 pp512 14.08 14.06 1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz llama 7B Q4_0 10 tg128 7.49 7.52 1.00

Mobile NVIDIA 3060: $ LLAMA_CUDA=1 ./scripts/compare-commits.sh master threadpool -nkvo 0,1

GPU Model NKVO Test t/s master t/s threadpool-attempt-2 Speedup
RTX 3060 Laptop GPU llama 7B Q4_0 No pp512 1644.73 1642.34 1.00
RTX 3060 Laptop GPU llama 7B Q4_0 No tg128 65.94 65.89 1.00
RTX 3060 Laptop GPU llama 7B Q4_0 Yes pp512 287.28 286.44 1.00
RTX 3060 Laptop GPU llama 7B Q4_0 Yes tg128 54.56 54.32 1.00

fmz avatar Jul 24 '24 16:07 fmz

@slaren Threadpool is back! Updated it a bit to be aligned with the latest graph-compute design. The current performance is largely on par with OpenMP. Please lmk if you have any comments/suggestions?

fmz avatar Jul 26 '24 16:07 fmz

I tried to test this on macOS, but it seems to deadlock.

WARNING: ThreadSanitizer: data race (pid=62377)
  Write of size 1 at 0x00010ab02a8e by main thread:
    #0 ggml_graph_compute ggml.c:19365 (llama-bench:arm64+0x10003fb54)
    #1 ggml_backend_cpu_graph_compute ggml-backend.c:822 (llama-bench:arm64+0x1000a5f1c)
    #2 ggml_backend_graph_compute_async ggml-backend.c:282 (llama-bench:arm64+0x10009bac0)
    #3 ggml_backend_sched_compute_splits ggml-backend.c:1795 (llama-bench:arm64+0x1000a3190)
    #4 ggml_backend_sched_graph_compute_async ggml-backend.c:1979 (llama-bench:arm64+0x1000a2d24)
    #5 llama_graph_compute(llama_context&, ggml_cgraph*, int, ggml_compute_threadpool*) llama.cpp:14412 (llama-bench:arm64+0x100292cac)
    #6 llama_decode_internal(llama_context&, llama_batch) llama.cpp:14666 (llama-bench:arm64+0x1000fda4c)
    #7 llama_decode llama.cpp:18489 (llama-bench:arm64+0x1000fc460)
    #8 test_prompt(llama_context*, int, int, int, int) llama-bench.cpp:1319 (llama-bench:arm64+0x10062bd5c)
    #9 main llama-bench.cpp:1454 (llama-bench:arm64+0x100627180)

  Previous read of size 1 at 0x00010ab02a8e by thread T12 (mutexes: write M0):
    #0 ggml_graph_compute_check_for_work ggml.c:19152 (llama-bench:arm64+0x100053a10)
    #1 ggml_graph_compute_secondary_thread ggml.c:19189 (llama-bench:arm64+0x1000537dc)

  Location is heap block of size 192 at 0x00010ab02a00 allocated by main thread:
    #0 posix_memalign <null>:96112260 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x564c0)
    #1 ggml_aligned_malloc ggml.c:241 (llama-bench:arm64+0x10001ac88)
    #2 ggml_create_threadpool_impl ggml.c:19214 (llama-bench:arm64+0x10003f14c)
    #3 ggml_create_threadpool ggml.c:19292 (llama-bench:arm64+0x10003f0b8)
    #4 main llama-bench.cpp:1418 (llama-bench:arm64+0x100626cd0)

  Mutex M0 (0x00010ab02a00) created at:
    #0 pthread_mutex_init <null>:96112260 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x31470)
    #1 ggml_create_threadpool_impl ggml.c:19238 (llama-bench:arm64+0x10003f404)
    #2 ggml_create_threadpool ggml.c:19292 (llama-bench:arm64+0x10003f0b8)
    #3 main llama-bench.cpp:1418 (llama-bench:arm64+0x100626cd0)

  Thread T12 (tid=36579987, running) created by main thread at:
    #0 pthread_create <null>:96112260 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x3062c)
    #1 ggml_create_threadpool_impl ggml.c:19277 (llama-bench:arm64+0x10003f638)
    #2 ggml_create_threadpool ggml.c:19292 (llama-bench:arm64+0x10003f0b8)
    #3 main llama-bench.cpp:1418 (llama-bench:arm64+0x100626cd0)

SUMMARY: ThreadSanitizer: data race ggml.c:19365 in ggml_graph_compute
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x000000012481fc00) at ggml.c:19132:5
    frame #2: 0x0000000104ba17ec llama-bench`ggml_graph_compute(cgraph=0x00000001182901b8, cplan=0x000000016b28a730) at ggml.c:19373:5
    frame #3: 0x0000000104be8400 llama-bench`ggml_backend_cpu_graph_compute(backend=0x00006000039680b0, cgraph=0x00000001182901b8) at ggml-backend.c:822:12
    frame #4: 0x0000000104be23c4 llama-bench`ggml_backend_graph_compute_async(backend=0x00006000039680b0, cgraph=0x00000001182901b8) at ggml-backend.c:282:12
    frame #5: 0x0000000104be6834 llama-bench`ggml_backend_sched_compute_splits(sched=0x0000000115000000) at ggml-backend.c:1795:35
    frame #6: 0x0000000104be65a0 llama-bench`ggml_backend_sched_graph_compute_async(sched=0x0000000115000000, graph=0x0000000118420020) at ggml-backend.c:1979:12
    frame #7: 0x0000000104d09f44 llama-bench`llama_graph_compute(lctx=0x0000000114813e00, gf=0x0000000118420020, n_threads=12, threadpool=0x0000600003b6c3c0) at llama.cpp:14412:5
    frame #8: 0x0000000104c2b148 llama-bench`llama_decode_internal(lctx=0x0000000114813e00, batch_all=llama_batch @ 0x000000016b28ac60) at llama.cpp:14666:9
    frame #9: 0x0000000104c2a15c llama-bench`llama_decode(ctx=0x0000000114813e00, batch=llama_batch @ 0x000000016b28ad08) at llama.cpp:18489:21
    frame #10: 0x0000000104f3ecbc llama-bench`test_prompt(ctx=0x0000000114813e00, n_prompt=32, n_past=0, n_batch=2048, n_threads=12) at llama-bench.cpp:1319:9
    frame #11: 0x0000000104f3ae44 llama-bench`main(argc=9, argv=0x000000016b28b940) at llama-bench.cpp:1454:13
    frame #12: 0x000000018fbae0e0 dyld`start + 2360
  thread #2
    frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x000000012481fe20) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x000000012481fe20) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #3
    frame #0: 0x0000000104baf690 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3019:54
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124820040) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820040) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #4
    frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124820260) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820260) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #5
    frame #0: 0x0000000104baf690 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3019:54
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124820480) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820480) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #6
    frame #0: 0x0000000104baf690 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3019:54
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x00000001248206a0) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001248206a0) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #7
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001248208c0) at ggml.c:19154:17
    frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001248208c0) at ggml.c:19189:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #8
    frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124820ae0) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820ae0) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #9
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000124820d00) at ggml.c:19154:17
    frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820d00) at ggml.c:19189:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #10
    frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124820f20) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820f20) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #11
    frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124821140) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124821140) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #12
    frame #0: 0x0000000104baf690 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3019:54
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124821360) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124821360) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #13
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000124821580) at ggml.c:19154:17
    frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124821580) at ggml.c:19189:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #14
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001248217a0) at ggml.c:19154:17
    frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001248217a0) at ggml.c:19189:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #15
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001248219c0) at ggml.c:19154:17
    frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001248219c0) at ggml.c:19189:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #16
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000124821be0) at ggml.c:19154:17
    frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124821be0) at ggml.c:19189:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #17
    frame #0: 0x000000018fef7ea4 libsystem_kernel.dylib`__workq_kernreturn + 8

slaren avatar Jul 26 '24 18:07 slaren

I tried to test this on macOS, but it seems to deadlock.

Fixed!

fmz avatar Jul 26 '24 20:07 fmz

On M2 Max: (GGML_NO_METAL=1 GGML_NO_ACCELERATE=1)

CPU Model Threads Test t/s master t/s threadpool Speedup
llama 7B Q4_0 4 pp512 32.97 34.87 1.06
llama 7B Q4_0 4 tg128 18.01 18.37 1.02
llama 7B Q4_0 6 pp512 47.43 48.99 1.03
llama 7B Q4_0 6 tg128 23.10 23.32 1.01
llama 7B Q4_0 8 pp512 49.90 55.17 1.11
llama 7B Q4_0 8 tg128 18.09 21.98 1.22
llama 7B Q4_0 10 pp512 52.50 56.69 1.08
llama 7B Q4_0 10 tg128 14.24 8.54 0.60
llama 7B Q4_0 12 pp512 56.37 56.93 1.01
llama 7B Q4_0 12 tg128 5.02 9.44 1.88

fmz avatar Jul 26 '24 20:07 fmz

Same thing, but with llama-v3 8B Q4_0_4_4 (for some reason my compiler AppleClang15 doesn't support INT8 matmul?)

CPU Model Threads Test t/s master t/s threadpool Speedup
llama 8B Q4_0_4_4 4 pp512 72.44 72.83 1.01
llama 8B Q4_0_4_4 4 tg128 22.29 23.50 1.05
llama 8B Q4_0_4_4 6 pp512 98.71 100.21 1.02
llama 8B Q4_0_4_4 6 tg128 24.63 24.44 0.99
llama 8B Q4_0_4_4 8 pp512 95.86 116.17 1.21
llama 8B Q4_0_4_4 8 tg128 21.19 26.28 1.24
llama 8B Q4_0_4_4 10 pp512 102.37 105.18 1.03
llama 8B Q4_0_4_4 10 tg128 18.63 16.98 0.91
llama 8B Q4_0_4_4 12 pp512 108.08 101.18 0.94
llama 8B Q4_0_4_4 12 tg128 6.22 11.39 1.83

fmz avatar Jul 26 '24 21:07 fmz

ref: original PR #7526

Added an API to support explicit management and fine-grain control of threadpools. The API supports creating different threadpools for various parts of execution, e.g. batch, single-token, etc. Each threadpool can be created, paused, resumed, and released independently from any other threadpools. This mitigates the overhead of starting/stopping threads for each decode call and helps OSes keep track of scheduling history in order to make better scheduling decisions.

Each threadpool supports:

Setting number of threads (duh) Setting a CPU mask for threads to be placed on Support for strict/relaxed placement: pinning specific threads to specific cores, or letting the OS decide Support for polling/interrupt-driven wait Setting thread priority Using threadpools explicitly is optional. If a llama_decode is called with a llama_context that doesn't have a threadpool attached, a disposable threadpool is created (same as the current behavior). If users choose to explicitly use threadpools, they have to manage them manually. See example in main.cpp.

With all the bells and whistles enabled, we generally see a minor improvement vs OMP. Without polling, threadpool runs on par with OMP.

If it crashes, can the error message include "deadpool"?

oldgithubman avatar Jul 26 '24 21:07 oldgithubman

@slaren lmk if it works for you this time

fmz avatar Jul 29 '24 13:07 fmz

I tested this again on the M3 Max, but it still seems to deadlock. These are the call stacks of the threads:

(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fc00) at ggml.c:19133:5
    frame #2: 0x0000000102383190 llama-bench`ggml_graph_compute(cgraph=0x00000001085f81c8, cplan=0x000000016daa2730) at ggml.c:19374:5
    frame #3: 0x00000001023c3394 llama-bench`ggml_backend_cpu_graph_compute(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:822:12
    frame #4: 0x00000001023bd840 llama-bench`ggml_backend_graph_compute_async(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:282:12
    frame #5: 0x00000001023c1864 llama-bench`ggml_backend_sched_compute_splits(sched=0x000000010680c400) at ggml-backend.c:1800:35
    frame #6: 0x00000001023c15a0 llama-bench`ggml_backend_sched_graph_compute_async(sched=0x000000010680c400, graph=0x00000001081c0020) at ggml-backend.c:1987:12
    frame #7: 0x00000001024e0b58 llama-bench`llama_graph_compute(lctx=0x000000010680fa00, gf=0x00000001081c0020, n_threads=12, threadpool=0x00006000027e43c0) at llama.cpp:14425:5
    frame #8: 0x0000000102404938 llama-bench`llama_decode_internal(lctx=0x000000010680fa00, batch_all=llama_batch @ 0x000000016daa2c60) at llama.cpp:14679:9
    frame #9: 0x0000000102403a9c llama-bench`llama_decode(ctx=0x000000010680fa00, batch=llama_batch @ 0x000000016daa2d08) at llama.cpp:18499:21
    frame #10: 0x0000000102712eac llama-bench`test_prompt(ctx=0x000000010680fa00, n_prompt=32, n_past=0, n_batch=2048, n_threads=12) at llama-bench.cpp:1319:9
    frame #11: 0x000000010270f0b8 llama-bench`main(argc=9, argv=0x000000016daa3940) at llama-bench.cpp:1454:13
    frame #12: 0x000000018fbae0e0 dyld`start + 2360
  thread #2
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fe20) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x000000013681fe20) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #3
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820040) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820040) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #4
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820260) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820260) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #5
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820480) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820480) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #6
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368206a0) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368206a0) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #7
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368208c0) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368208c0) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #8
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820ae0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820ae0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #9
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820d00) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820d00) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #10
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820f20) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820f20) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #11
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821140) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821140) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #12
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821360) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821360) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #13
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821580) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821580) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #14
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368217a0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368217a0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #15
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368219c0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368219c0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #16
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821be0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821be0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #17
    frame #0: 0x000000018fef7ea4 libsystem_kernel.dylib`__workq_kernreturn + 8

Built with LLAMA_DEBUG=1 GGML_NO_METAL=1 make llama-bench && ./llama-bench -m models/llama-2-7b/ggml-model-Q4_0.gguf -n 0 -r 1 -p 32.

slaren avatar Jul 31 '24 14:07 slaren

I tested this again on the M3 Max, but it still seems to deadlock. These are the call stacks of the threads:

(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fc00) at ggml.c:19133:5
    frame #2: 0x0000000102383190 llama-bench`ggml_graph_compute(cgraph=0x00000001085f81c8, cplan=0x000000016daa2730) at ggml.c:19374:5
    frame #3: 0x00000001023c3394 llama-bench`ggml_backend_cpu_graph_compute(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:822:12
    frame #4: 0x00000001023bd840 llama-bench`ggml_backend_graph_compute_async(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:282:12
    frame #5: 0x00000001023c1864 llama-bench`ggml_backend_sched_compute_splits(sched=0x000000010680c400) at ggml-backend.c:1800:35
    frame #6: 0x00000001023c15a0 llama-bench`ggml_backend_sched_graph_compute_async(sched=0x000000010680c400, graph=0x00000001081c0020) at ggml-backend.c:1987:12
    frame #7: 0x00000001024e0b58 llama-bench`llama_graph_compute(lctx=0x000000010680fa00, gf=0x00000001081c0020, n_threads=12, threadpool=0x00006000027e43c0) at llama.cpp:14425:5
    frame #8: 0x0000000102404938 llama-bench`llama_decode_internal(lctx=0x000000010680fa00, batch_all=llama_batch @ 0x000000016daa2c60) at llama.cpp:14679:9
    frame #9: 0x0000000102403a9c llama-bench`llama_decode(ctx=0x000000010680fa00, batch=llama_batch @ 0x000000016daa2d08) at llama.cpp:18499:21
    frame #10: 0x0000000102712eac llama-bench`test_prompt(ctx=0x000000010680fa00, n_prompt=32, n_past=0, n_batch=2048, n_threads=12) at llama-bench.cpp:1319:9
    frame #11: 0x000000010270f0b8 llama-bench`main(argc=9, argv=0x000000016daa3940) at llama-bench.cpp:1454:13
    frame #12: 0x000000018fbae0e0 dyld`start + 2360
  thread #2
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fe20) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x000000013681fe20) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #3
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820040) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820040) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #4
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820260) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820260) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #5
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820480) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820480) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #6
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368206a0) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368206a0) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #7
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368208c0) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368208c0) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #8
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820ae0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820ae0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #9
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820d00) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820d00) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #10
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820f20) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820f20) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #11
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821140) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821140) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #12
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821360) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821360) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #13
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821580) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821580) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #14
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368217a0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368217a0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #15
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368219c0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368219c0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #16
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821be0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821be0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #17
    frame #0: 0x000000018fef7ea4 libsystem_kernel.dylib`__workq_kernreturn + 8

Built with LLAMA_DEBUG=1 GGML_NO_METAL=1 make llama-bench && ./llama-bench -m models/llama-2-7b/ggml-model-Q4_0.gguf -n 0 -r 1 -p 32.

Bummer... Thanks for the details! Looks like we got some trouble in the "ACCELERATE" path I'll fix it asap

fmz avatar Jul 31 '24 15:07 fmz

I tested this again on the M3 Max, but it still seems to deadlock. These are the call stacks of the threads:

(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fc00) at ggml.c:19133:5
    frame #2: 0x0000000102383190 llama-bench`ggml_graph_compute(cgraph=0x00000001085f81c8, cplan=0x000000016daa2730) at ggml.c:19374:5
    frame #3: 0x00000001023c3394 llama-bench`ggml_backend_cpu_graph_compute(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:822:12
    frame #4: 0x00000001023bd840 llama-bench`ggml_backend_graph_compute_async(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:282:12
    frame #5: 0x00000001023c1864 llama-bench`ggml_backend_sched_compute_splits(sched=0x000000010680c400) at ggml-backend.c:1800:35
    frame #6: 0x00000001023c15a0 llama-bench`ggml_backend_sched_graph_compute_async(sched=0x000000010680c400, graph=0x00000001081c0020) at ggml-backend.c:1987:12
    frame #7: 0x00000001024e0b58 llama-bench`llama_graph_compute(lctx=0x000000010680fa00, gf=0x00000001081c0020, n_threads=12, threadpool=0x00006000027e43c0) at llama.cpp:14425:5
    frame #8: 0x0000000102404938 llama-bench`llama_decode_internal(lctx=0x000000010680fa00, batch_all=llama_batch @ 0x000000016daa2c60) at llama.cpp:14679:9
    frame #9: 0x0000000102403a9c llama-bench`llama_decode(ctx=0x000000010680fa00, batch=llama_batch @ 0x000000016daa2d08) at llama.cpp:18499:21
    frame #10: 0x0000000102712eac llama-bench`test_prompt(ctx=0x000000010680fa00, n_prompt=32, n_past=0, n_batch=2048, n_threads=12) at llama-bench.cpp:1319:9
    frame #11: 0x000000010270f0b8 llama-bench`main(argc=9, argv=0x000000016daa3940) at llama-bench.cpp:1454:13
    frame #12: 0x000000018fbae0e0 dyld`start + 2360
  thread #2
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fe20) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x000000013681fe20) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #3
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820040) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820040) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #4
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820260) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820260) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #5
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820480) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820480) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #6
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368206a0) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368206a0) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #7
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368208c0) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368208c0) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #8
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820ae0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820ae0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #9
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820d00) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820d00) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #10
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820f20) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820f20) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #11
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821140) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821140) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #12
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821360) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821360) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #13
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821580) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821580) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #14
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368217a0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368217a0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #15
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368219c0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368219c0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #16
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821be0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821be0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #17
    frame #0: 0x000000018fef7ea4 libsystem_kernel.dylib`__workq_kernreturn + 8

Built with LLAMA_DEBUG=1 GGML_NO_METAL=1 make llama-bench && ./llama-bench -m models/llama-2-7b/ggml-model-Q4_0.gguf -n 0 -r 1 -p 32.

Bummer... Thanks for the details! Looks like we got some trouble in the "ACCELERATE" path I'll fix it asap

@slaren turns out there was a bit of a corner case where if you have a graph with only 1 node, ggml_barrier and wait_for_work deadlock on each other. Added a check to handle that specific case

fmz avatar Jul 31 '24 16:07 fmz

Thanks, I was able to run it now. Unfortunately the results are still not very good on my system. Under WSL this threadpool is much slower than OpenMP. A threadpool would be more important on macOS, since OpenMP is not available there, but for me it is also slower on the M3 Max.

M3 Max: GGML_NO_METAL=1 scripts/compare-commits.sh master threadpool -m models/llama-2-7b/ggml-model-Q4_0.gguf

CPU Model Model Size [GiB] Test t/s master t/s threadpool Speedup
M3 Max llama 7B Q4_0 3.56 pp512 151.21 149.88 0.99
M3 Max llama 7B Q4_0 3.56 tg128 30.06 26.09 0.87

13900k + 3090Ti: OpenMP (GGML_CUDA=1 make llama-bench && ./llama-bench -nkvo 0,1)

model size params backend ngl nkvo test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 5699.53 ± 19.73
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 150.75 ± 1.23
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 651.63 ± 32.31
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 63.85 ± 3.22

Threadpool (GGML_CUDA=1 GGML_NO_OPENMP=1 make llama-bench && ./llama-bench -nkvo 0,1)

model size params backend ngl nkvo test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 5453.33 ± 216.72
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 144.45 ± 0.98
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 566.43 ± 27.64
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 29.54 ± 0.99

build: bebe99c2 (3500)

slaren avatar Aug 01 '24 17:08 slaren

Thanks, I was able to run it now. Unfortunately the results are still not very good on my system. Under WSL this threadpool is much slower than OpenMP. A threadpool would be more important on macOS, since OpenMP is not available there, but for me it is also slower on the M3 Max.

M3 Max: GGML_NO_METAL=1 scripts/compare-commits.sh master threadpool -m models/llama-2-7b/ggml-model-Q4_0.gguf

CPU Model Model Size [GiB] Test t/s master t/s threadpool Speedup M3 Max llama 7B Q4_0 3.56 pp512 151.21 149.88 0.99 M3 Max llama 7B Q4_0 3.56 tg128 30.06 26.09 0.87 13900k + 3090Ti: OpenMP (GGML_CUDA=1 make llama-bench && ./llama-bench -nkvo 0,1)

model size params backend ngl nkvo test t/s llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 5699.53 ± 19.73 llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 150.75 ± 1.23 llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 651.63 ± 32.31 llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 63.85 ± 3.22 Threadpool (GGML_CUDA=1 GGML_NO_OPENMP=1 make llama-bench && ./llama-bench -nkvo 0,1)

model size params backend ngl nkvo test t/s llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 5453.33 ± 216.72 llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 144.45 ± 0.98 llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 566.43 ± 27.64 llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 29.54 ± 0.99 build: bebe99c (3500)

Ooof... That is quite a bit slower. I'll try to replicate this locally

fmz avatar Aug 01 '24 17:08 fmz

@fmz @slaren I fixed one of the issues that was causing regressions. We were setting the default number of threads in the threadpool using std::thread::hardware_concurrency(). I updated that to use cpu_get_num_math() this way we are going to exclude E-Cores and Hypethreading siblings. This is what was causing regressions with the default cmd line args where the number of threads is not explicitly specified. We were starting 12 threads on M2 Max, where only 8 cores are really usable, same on AMD EPYC (using siblings) and Intel 13/14th Gen (using E-Cores).

I'm also working on another fix which is specific to llama-bench. Currently (in the threadpool branch) we start a single threadpool with max-num-threads and reuse it for each test. Suppose the test is using 4 threads but we'd start 12 (on M2 Max or Snapdragon X-Elite). This is suboptimal because the spinning threads interfere with Core boosting and things. It's better to start a fresh threadpool for each test.

max-krasnyansky avatar Aug 03 '24 23:08 max-krasnyansky

@fmz @slaren llama-bench has been updated as I described above.

Here are the numbers from M2 Max. I'll share numbers for an AMD EPYC server, Snapdragon X-Elite and Gen-3 a bit later.

CC=clang CXX=clang++ CFLAGS="-march=armv8.7-a" CXXFLAGS="-march=armv8.7-a" GGML_NO_METAL=1 GGML_NO_ACCELERATE=1 \
    ./scripts/compare-commits.sh master threadpool -m ../gguf/llama-v3.1.q4_0_4_8.gguf -ngl 0 -t 4,6,8
...
+ ./scripts/compare-llama-bench.py -b master -c threadpool
| CPU   | Model             |   Threads | Test   |   t/s master |   t/s threadpool |   Speedup |
|:------|:------------------|----------:|:-------|-------------:|-----------------:|----------:|
|       | llama 8B Q4_0_4_8 |         4 | pp512  |        64.43 |            64.52 |      1.00 |
|       | llama 8B Q4_0_4_8 |         4 | tg128  |        22.53 |            24.36 |      1.08 |
|       | llama 8B Q4_0_4_8 |         6 | pp512  |        89.79 |            91.04 |      1.01 |
|       | llama 8B Q4_0_4_8 |         6 | tg128  |        24.73 |            26.21 |      1.06 |
|       | llama 8B Q4_0_4_8 |         8 | pp512  |       117.14 |           118.67 |      1.01 |
|       | llama 8B Q4_0_4_8 |         8 | tg128  |        26.11 |            26.37 |      1.01 |

max-krasnyansky avatar Aug 04 '24 01:08 max-krasnyansky

The performance looks better now, with nkvo it is comparable to OpenMP, which is very good. There is still a performance drop when using the BLAS backend (this includes the default build in macOS, which uses Accelerate). I suspect that this is because the threads are spinning in ggml_graph_compute_check_for_work while the BLAS backend is running. This will also cause the threads to spin while the GPU backend is running when partially offloading, which would be a reggression. Rather than requiring the user to disable polling manually, I suggest implementing some kind of backoff and yield the threads after spinning for a while. The BLAS backend (ggml-blas.cpp) may also benefit from using the threadpool, since it launches several threads to dequantize the weights, and it could also automatically pause the pool during the call to the BLAS library.

Results

GGML_CUDA=1 make llama-bench > /dev/null && ./llama-bench -nkvo 0,1 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes

model size params backend ngl nkvo test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 5689.32 ± 13.35
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 154.53 ± 1.04
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 643.28 ± 31.69
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 64.27 ± 2.21

build: 267bf570 (3554)

GGML_CUDA=1 GGML_NO_OPENMP=1 make llama-bench > /dev/null && ./llama-bench -nkvo 0,1 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes

model size params backend ngl nkvo test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 5674.51 ± 37.77
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 153.30 ± 0.48
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 646.42 ± 32.41
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 62.98 ± 2.94

build: 267bf570 (3554)

GGML_BLIS=1 make llama-bench > /dev/null && ./llama-bench -p 128 -n 32

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B BLAS 16 pp128 47.55 ± 0.17
llama 7B Q4_0 3.56 GiB 6.74 B BLAS 16 tg32 20.79 ± 0.10

build: 267bf570 (3554)

GGML_BLIS=1 GGML_NO_OPENMP=1 make llama-bench > /dev/null && ./llama-bench -p 128 -n 32

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B BLAS 16 pp128 33.47 ± 0.48
llama 7B Q4_0 3.56 GiB 6.74 B BLAS 16 tg32 20.58 ± 0.07

build: 267bf570 (3554)

CPU Model Threads Test t/s master t/s threadpool Speedup
M3 Max llama 7B all F32 4 pp512 150.03 134.77 0.90
M3 Max llama 7B all F32 4 tg128 4.76 4.20 0.88
M3 Max llama 7B all F32 8 pp512 155.66 115.40 0.74
M3 Max llama 7B all F32 8 tg128 4.76 4.35 0.91
M3 Max llama 7B all F32 12 pp512 156.19 94.43 0.60
M3 Max llama 7B all F32 12 tg128 4.66 4.33 0.93
M3 Max llama 7B Q4_0 4 pp512 142.43 144.89 1.02
M3 Max llama 7B Q4_0 4 tg128 21.04 20.74 0.99
M3 Max llama 7B Q4_0 8 pp512 150.08 142.22 0.95
M3 Max llama 7B Q4_0 8 tg128 28.22 28.14 1.00
M3 Max llama 7B Q4_0 12 pp512 150.55 120.62 0.80
M3 Max llama 7B Q4_0 12 tg128 30.10 30.26 1.01
M3 Max stories260K 4 pp512 52491.62 65492.68 1.25
M3 Max stories260K 4 tg128 8417.80 12262.68 1.46
M3 Max stories260K 8 pp512 59893.07 94300.47 1.57
M3 Max stories260K 8 tg128 3746.70 5639.87 1.51
M3 Max stories260K 12 pp512 53756.90 115958.90 2.16
M3 Max stories260K 12 tg128 2507.28 4333.34 1.73

slaren avatar Aug 08 '24 22:08 slaren

The performance looks better now, with nkvo it is comparable to OpenMP, which is very good. There is still a performance drop when using the BLAS backend (this includes the default build in macOS, which uses Accelerate). I suspect that this is because the threads are spinning in ggml_graph_compute_check_for_work while the BLAS backend is running. This will also cause the threads to spin while the GPU backend is running when partially offloading, which would be a reggression. Rather than requiring the user to disable polling manually, I suggest implementing some kind of backoff and yield the threads after spinning for a while. The BLAS backend (ggml-blas.cpp) may also benefit from using the threadpool, since it launches several threads to dequantize the weights, and it could also automatically pause the pool during the call to the BLAS library.

@slaren

Awesome! Thanks for checking out the latest. We've been doing lots of profiling and tuning. Every time I'm about to send an updated perf report on Snapdragons and M2 I find yet another thing to improve :) In my testing we're doing really well with the CPU backend (especially on the ARM64-based systems), with other backends, as you pointed out, the spinning threads get in the way at times and cause regressions. I'll try your suggestions.

btw We might just flip the default back to non-polling. Technically polling is only useful for the llama-bench to match OpenMP behavior/numbers in that case. When I looked at the original profiles, I saw that the threadpool is doing a lot more context switches than OpenMP during token-gen test. Polling removes those context switches and we get even better numbers now. It might make sense to make that a bit of a special case (ie default to polling for the CPU backend bench, otherwise default is non-polling) or some hybrid approach as you suggested.

max-krasnyansky avatar Aug 09 '24 00:08 max-krasnyansky

@slaren @fmz

I managed to further improve the threadpool signaling (reducing the number of wake-ups, etc) and also introduced the hybrid polling mode which is now the default. --poll now sets the polling level which is basically how aggressively we poll. 0 means no polling, 1 means around 128K polling rounds then cond.wait, 2x128K rounds, etc. The default is 50 which seems to work well on the machines I got here (see the report).

The regression with the Metal backend should be fixed now (see the report below).

The BLIS backend will need some further tuning. Though I wonder how useful it is given how much slower it is compared to plain CPU backend with the latest CPU features.

I included latest llama-bench and few simple llama-cli results for M2 Max, AMD EPYC 7543, Snapdragon X-Elite and Snapgragon Gen 3 with LLama v3.1 8B and a smaller LLama-based 314M model (generated using https://arxiv.org/pdf/2403.00858)

Results

M2 Max (default build)

make clean; make llama-bench

~/src/llama.cpp-master$ ./llama-bench -m ../gguf/llama-v3.1.q4_0.gguf -p 128 -n 32

model size params backend ngl test t/s
llama 8B Q4_0 4.33 GiB 8.03 B Metal 99 pp128 469.51 ± 1.07
llama 8B Q4_0 4.33 GiB 8.03 B Metal 99 tg32 58.92 ± 0.22

build: 3071c0a5 (3557)

~/src/llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v3.1.q4_0.gguf -p 128 -n 32

model size params backend ngl test t/s
llama 8B Q4_0 4.33 GiB 8.03 B Metal 99 pp128 470.04 ± 0.25
llama 8B Q4_0 4.33 GiB 8.03 B Metal 99 tg32 58.78 ± 0.18

build: 323181f2 (3573)

M2 Max (llvm build to enable MATMUL_INT8)

CC=clang CXX=clang++ CFLAGS="-march=armv8.7-a" CXXFLAGS="-march=armv8.7-a" GGML_NO_METAL=1 GGML_NO_ACCELERATE=1 make -j32 llama-bench llama-cli

  • Note Q4_0_4_X is broken with ACCELERATE, but BLIS and ACCELERATE are much slower anyway *

~/src/llama.cpp-master$ ./llama-bench -m ../gguf/llama-v3.1.q4_0_4_8.gguf -t 4,6,8

model size params backend threads test t/s
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 4 pp512 63.69 ± 0.37
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 4 tg128 22.51 ± 0.11
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 6 pp512 90.60 ± 1.19
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 6 tg128 24.73 ± 0.06
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 8 pp512 112.72 ± 2.76
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 8 tg128 25.21 ± 0.86

build: 3071c0a5 (3557)

~/src/llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v3.1.q4_0_4_8.gguf -t 4,6,8

model size params backend threads test t/s
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 4 pp512 65.59 ± 0.70
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 4 tg128 24.80 ± 0.13
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 6 pp512 92.72 ± 1.75
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 6 tg128 26.12 ± 0.12
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 8 pp512 116.72 ± 1.33
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 8 tg128 26.34 ± 0.05

build: 323181f2 (3573)

~/src/llama.cpp-master$ ./llama-bench -m ../gguf/llama-v3-314m.q4_0_4_8.gguf -t 4,6,8

model size params backend threads test t/s
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 4 pp512 6216.28 ± 206.77
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 4 tg128 405.77 ± 0.85
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 6 pp512 9223.23 ± 105.64
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 6 tg128 544.30 ± 0.48
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 8 pp512 12073.97 ± 76.67
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 8 tg128 616.44 ± 1.81

build: 3071c0a5 (3557)

~/src/llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v3-314m.q4_0_4_8.gguf -t 4,6,8

model size params backend threads test t/s
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 4 pp512 6680.06 ± 70.44
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 4 tg128 503.61 ± 1.44
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 6 pp512 9431.40 ± 13.32
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 6 tg128 638.16 ± 4.39
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 8 pp512 12292.80 ± 40.62
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 8 tg128 674.69 ± 12.01

build: 323181f2 (3573)

M2 Max (BLIS backend)

~/src/llama.cpp-master$ ./llama-bench -m ../gguf/llama-v3.1.q4_0.gguf -p 64 -n 16

model size params backend threads test t/s
llama 8B Q4_0 4.33 GiB 8.03 B BLAS 8 pp64 48.69 ± 0.08
llama 8B Q4_0 4.33 GiB 8.03 B BLAS 8 tg16 24.68 ± 0.10

build: 3071c0a5 (3557)

~/src/llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v3.1.q4_0.gguf -p 64 -n 16

model size params backend threads test t/s
llama 8B Q4_0 4.33 GiB 8.03 B BLAS 8 pp64 35.90 ± 0.97
llama 8B Q4_0 4.33 GiB 8.03 B BLAS 8 tg16 24.68 ± 0.32

build: 323181f2 (3573)

~/src/llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v3.1.q4_0.gguf -p 64 -n 16 --poll 0

model size params backend threads test t/s
llama 8B Q4_0 4.33 GiB 8.03 B BLAS 8 pp64 46.44 ± 2.06
llama 8B Q4_0 4.33 GiB 8.03 B BLAS 8 tg16 24.81 ± 0.03

build: 323181f2 (3573)

AMD EPYC (default build)

  • Note Q4_K has the best perf on the EPYC *

make -j32 llama-bench

llama.cpp-master$ ./llama-bench -m ../gguf/llama-v3.1-8B.q4_k_s-pure.gguf -t 16,32,64 -p 64 -n 16

model size params backend threads test t/s
llama 8B Q4_K - Small 4.21 GiB 8.03 B CPU 16 pp64 63.11 ± 0.14
llama 8B Q4_K - Small 4.21 GiB 8.03 B CPU 16 tg16 18.82 ± 0.88
llama 8B Q4_K - Small 4.21 GiB 8.03 B CPU 32 pp64 116.73 ± 3.24
llama 8B Q4_K - Small 4.21 GiB 8.03 B CPU 32 tg16 22.86 ± 0.71
llama 8B Q4_K - Small 4.21 GiB 8.03 B CPU 64 pp64 141.81 ± 0.55
llama 8B Q4_K - Small 4.21 GiB 8.03 B CPU 64 tg16 21.13 ± 0.34

build: 3071c0a5 (3557)

GGML_NO_OPENMP=1 make -j32 llama-bench

llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v3.1-8B.q4_k_s-pure.gguf -t 16,32,64 -p 64 -n 16

model size params backend threads test t/s
llama 8B Q4_K - Small 4.21 GiB 8.03 B CPU 16 pp64 62.82 ± 0.96
llama 8B Q4_K - Small 4.21 GiB 8.03 B CPU 16 tg16 16.09 ± 0.28
llama 8B Q4_K - Small 4.21 GiB 8.03 B CPU 32 pp64 110.30 ± 1.07
llama 8B Q4_K - Small 4.21 GiB 8.03 B CPU 32 tg16 19.15 ± 0.82
llama 8B Q4_K - Small 4.21 GiB 8.03 B CPU 64 pp64 122.25 ± 5.91
llama 8B Q4_K - Small 4.21 GiB 8.03 B CPU 64 tg16 19.21 ± 0.47

build: 323181f2 (3573)

Real use-case does much better.

llama.cpp-master$ ./llama-cli -m ../gguf/llama-v3.1-8B.q4_k_s-pure.gguf -t 32 --seed 42 -p 'what is the most popular cookie in the world? (prease be brief)' -n 40
...
what is the most popular cookie in the world? (prease be brief) It is chocolate chip cookie. It is a classic favorite and one of the most beloved cookies.
What is the most popular cookie in the world?
According to various sources, including Google Trends and online baking
llama_print_timings:        load time =    4399.32 ms
llama_print_timings:      sample time =       3.48 ms /    40 runs   (    0.09 ms per token, 11490.95 tokens per second)
llama_print_timings: prompt eval time =     211.88 ms /    16 tokens (   13.24 ms per token,    75.51 tokens per second)
llama_print_timings:        eval time =    2267.88 ms /    39 runs   (   58.15 ms per token,    17.20 tokens per second)
llama_print_timings:       total time =    2489.99 ms /    55 tokens
llama.cpp-threadpool$ ./llama-cli -m ../gguf/llama-v3.1-8B.q4_k_s-pure.gguf -t 32 --seed 42 -p 'what is the most popular cookie in the world? (prease be brief)' -n 40
...
what is the most popular cookie in the world? (prease be brief) It is chocolate chip cookie. It is a classic favorite and one of the most beloved cookies.
What is the most popular cookie in the world?
According to various sources, including Google Trends and online baking
llama_print_timings:        load time =    4250.79 ms
llama_print_timings:      sample time =       2.92 ms /    40 runs   (    0.07 ms per token, 13708.02 tokens per second)
llama_print_timings: prompt eval time =     203.73 ms /    16 tokens (   12.73 ms per token,    78.54 tokens per second)
llama_print_timings:        eval time =    2072.78 ms /    39 runs   (   53.15 ms per token,    18.82 tokens per second)
llama_print_timings:       total time =    2285.96 ms /    55 tokens

AMD EPYC (BLIS backend)

llama.cpp-master$ ./llama-bench -m ../gguf/llama-v3.1-8B.q4_k_s-pure.gguf -t 32 -p 64 -n 16

model size params backend threads test t/s
llama 8B Q4_K - Small 4.21 GiB 8.03 B BLAS 32 pp64 29.85 ± 0.25
llama 8B Q4_K - Small 4.21 GiB 8.03 B BLAS 32 tg16 19.71 ± 0.63

build: 3071c0a5 (3557)

llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v3.1-8B.q4_k_s-pure.gguf -t 32 -p 64 -n 16

model size params backend threads test t/s
llama 8B Q4_K - Small 4.21 GiB 8.03 B BLAS 32 pp64 21.50 ± 1.53
llama 8B Q4_K - Small 4.21 GiB 8.03 B BLAS 32 tg16 17.49 ± 0.17

Snapdragon X-Elite (default llvm-windows build)

llama.cpp-master ./build-arm64-windows-llvm-release/bin/llama-bench.exe -m ../gguf/llama-v3.1.q4_0_4_8.gguf -t 4,6,8,10,12

model size params backend threads test t/s
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 4 pp512 70.06 ± 0.06
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 4 tg128 21.20 ± 0.11
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 6 pp512 100.08 ± 1.78
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 6 tg128 21.84 ± 0.15
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 8 pp512 125.61 ± 1.52
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 8 tg128 19.30 ± 3.68
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 10 pp512 144.20 ± 5.02
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 10 tg128 22.60 ± 0.21
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 12 pp512 175.83 ± 5.59
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 12 tg128 9.83 ± 7.34

build: 3071c0a5 (3557)

~/src/llama.cpp-threadpool ./build-arm64-windows-llvm-release/bin/llama-bench.exe -m ../gguf/llama-v3.1.q4_0_4_8.gguf -t 4,6,8,10,12

model size params backend threads test t/s
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 4 pp512 70.38 ± 0.22
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 4 tg128 21.71 ± 0.17
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 6 pp512 100.66 ± 1.74
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 6 tg128 23.14 ± 0.20
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 8 pp512 126.16 ± 2.03
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 8 tg128 22.91 ± 0.09
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 10 pp512 151.48 ± 3.02
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 10 tg128 23.61 ± 0.22
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 12 pp512 185.21 ± 2.95
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 12 tg128 21.82 ± 1.44

build: 323181f2 (3573)

llama.cpp-master ./build-arm64-windows-llvm-release/bin/llama-bench.exe -m ../gguf/llama-v3-314m.q4_0_4_8.gguf -t 4,6,8,10,12

model size params backend threads test t/s
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 4 pp512 5352.91 ± 17.93
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 4 tg128 345.11 ± 1.41
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 6 pp512 7660.97 ± 77.94
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 6 tg128 405.47 ± 5.90
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 8 pp512 9699.85 ± 62.08
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 8 tg128 438.34 ± 5.46
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 10 pp512 11651.56 ± 158.28
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 10 tg128 436.98 ± 4.93
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 12 pp512 12893.53 ± 943.14
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 12 tg128 408.23 ± 7.20

build: 3071c0a5 (3557)

~/src/llama.cpp-threadpool ./build-arm64-windows-llvm-release/bin/llama-bench.exe -m ../gguf/llama-v3-314m.q4_0_4_8.gguf -t 4,6,8,10,12

model size params backend threads test t/s
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 4 pp512 5378.80 ± 8.04
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 4 tg128 360.03 ± 1.04
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 6 pp512 7747.20 ± 62.34
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 6 tg128 530.54 ± 4.81
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 8 pp512 9895.07 ± 49.93
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 8 tg128 626.93 ± 4.11
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 10 pp512 11836.86 ± 111.19
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 10 tg128 667.55 ± 6.24
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 12 pp512 13231.19 ± 524.42
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 12 tg128 653.97 ± 21.23

build: 323181f2 (3573)

llama.cpp-master

./build-arm64-windows-llvm-release/bin/llama-cli.exe --no-mmap -m ../gguf/llama-v3.1.q4_0_4_8.gguf -tb 10 -t 6 --ctx-size 2048 --seed 42 -p '<|begin_of_text|> what is the most popular cookie in the world? (please be brief)' -n 64
...
what is the most popular cookie in the world? (please be brief)
The chocolate chip cookie is the most popular cookie in the world. (according to various sources)
Note: This answer is brief because the question asks for a brief response.
Let me know if you'd like me to elaborate!
(If you'd like more information, I'd be happy to provide some
llama_print_timings:        load time =    1164.20 ms
llama_print_timings:      sample time =       3.58 ms /    64 runs   (    0.06 ms per token, 17892.09 tokens per second)
llama_print_timings: prompt eval time =      90.19 ms /    16 tokens (    5.64 ms per token,   177.41 tokens per second)
llama_print_timings:        eval time =    3038.12 ms /    63 runs   (   48.22 ms per token,    20.74 tokens per second)
llama_print_timings:       total time =    3141.14 ms /    79 tokens

llama.cpp-threadpool

./build-arm64-windows-llvm-release/bin/llama-cli.exe --no-mmap -m ../gguf/llama-v3.1.q4_0_4_8.gguf -tb 10 -t 6 --ctx-size 2048 --seed 42 -p '<|begin_of_text|> what is the most popular cookie in the world? (please be brief)' -n 64
...
 what is the most popular cookie in the world? (please be brief)
The chocolate chip cookie is the most popular cookie in the world. (according to various sources)
Note: This answer is brief because the question asks for a brief response.
Let me know if you'd like me to elaborate!
(If you'd like more information, I'd be happy to provide some
llama_print_timings:        load time =    1168.97 ms
llama_print_timings:      sample time =       3.74 ms /    64 runs   (    0.06 ms per token, 17103.15 tokens per second)
llama_print_timings: prompt eval time =      87.27 ms /    16 tokens (    5.45 ms per token,   183.33 tokens per second)
llama_print_timings:        eval time =    2940.82 ms /    63 runs   (   46.68 ms per token,    21.42 tokens per second)
llama_print_timings:       total time =    3042.01 ms /    79 tokens

Snapdragon Gen 3 (Galaxy S24 Ultra)

Default Android NDK build using the following CMake preset

    {
        "name": "arm64-android",
        "cacheVariables": {     
            "ANDROID_ABI":      "arm64-v8a",
            "ANDROID_PLATFORM": "android-31",
            "CMAKE_TOOLCHAIN_FILE": "$env{NDK}/build/cmake/android.toolchain.cmake",
            "CMAKE_C_FLAGS":   "-march=armv8.7a -fvectorize -ffp-model=fast -fno-finite-math-only -flto -D_GNU_SOURCE",
            "CMAKE_CXX_FLAGS": "-march=armv8.7a -fvectorize -ffp-model=fast -fno-finite-math-only -flto -D_GNU_SOURCE",
            "CMAKE_C_FLAGS_RELEASE":   "-O3 -DNDEBUG",
            "CMAKE_CXX_FLAGS_RELEASE": "-O3 -DNDEBUG"
        }
    }

threadpool branch is built with -D GGML_OPENMP=OFF

adb shell "cd /data/local/tmp/lmcp; ./run-bench.sh master llama-v3.1.q4_0_4_8.gguf -t 4,6 -p 32 -n 16"
export 'LD_LIBRARY_PATH=/data/local/tmp/lmcp/master'
taskset fc ./master/llama-bench --mmap 0 --n-gpu-layers 0 -m llama-v3.1.q4_0_4_8.gguf -t 4,6 -p 32 -n 16
model size params backend threads mmap test t/s
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 4 0 pp32 40.43 ± 1.08
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 4 0 tg16 10.29 ± 0.13
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 6 0 pp32 48.27 ± 0.24
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 6 0 tg16 10.23 ± 0.13
adb shell "cd /data/local/tmp/lmcp; ./run-bench.sh threadpool llama-v3.1.q4_0_4_8.gguf -t 4,6 -p 32 -n 16"`
export 'LD_LIBRARY_PATH=/data/local/tmp/lmcp/threadpool'
taskset fc ./threadpool/llama-bench --mmap 0 --n-gpu-layers 0 -m llama-v3.1.q4_0_4_8.gguf -t 4,6 -p 32 -n 16
model size params backend threads mmap test t/s
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 4 0 pp32 40.53 ± 0.77
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 4 0 tg16 10.59 ± 0.16
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 6 0 pp32 48.69 ± 0.31
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 6 0 tg16 10.40 ± 0.13
adb shell "cd /data/local/tmp/lmcp; ./run-bench.sh master llama-v3-314m.q4_0_4_8.gguf -t 4,6 -p 32 -n 16"`
export 'LD_LIBRARY_PATH=/data/local/tmp/lmcp/master'
taskset fc ./master/llama-bench --mmap 0 --n-gpu-layers 0 -m llama-v3-314m.q4_0_4_8.gguf -t 4,6 -p 32 -n 16
model size params backend threads mmap test t/s
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 4 0 pp32 4006.52 ± 72.66
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 4 0 tg16 303.58 ± 5.96
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 6 0 pp32 5065.32 ± 82.08
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 6 0 tg16 302.77 ± 20.62
adb shell "cd /data/local/tmp/lmcp; ./run-bench.sh threadpool llama-v3-314m.q4_0_4_8.gguf -t 4,6 -p 32 -n 16"`
export 'LD_LIBRARY_PATH=/data/local/tmp/lmcp/threadpool'
taskset fc ./threadpool/llama-bench --mmap 0 --n-gpu-layers 0 -m llama-v3-314m.q4_0_4_8.gguf -t 4,6 -p 32 -n 16
model size params backend threads mmap test t/s
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 4 0 pp32 4097.12 ± 22.80
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 4 0 tg16 312.81 ± 1.88
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 6 0 pp32 5127.89 ± 35.28
llama ?B Q4_0_4_8 200.79 MiB 314.06 M CPU 6 0 tg16 337.77 ± 6.57

max-krasnyansky avatar Aug 12 '24 06:08 max-krasnyansky

Edit: I totally forgot that GGML_OPENMP is disabled only for cmake builds... So the numbers below are openmp only. (interesting that there is any change at all...)

@slaren @max-krasnyansky latest CUDA numbers:

Stories260K: $ ./scripts/compare-llama-bench.py -b master -c threadpool

GPU Model NKVO Test t/s master t/s threadpool Speedup
RTX 3060 Laptop GPU llama ?B all F32 (guessed) No pp512 199949.37 199425.82 1.00
RTX 3060 Laptop GPU llama ?B all F32 (guessed) No tg128 2472.27 2585.31 1.05
RTX 3060 Laptop GPU llama ?B all F32 (guessed) Yes pp512 12503.69 12627.24 1.01
RTX 3060 Laptop GPU llama ?B all F32 (guessed) Yes tg128 1632.84 1642.70 1.01

llamav2 7B:

GPU Model NKVO Test t/s master t/s threadpool Speedup
RTX 3060 Laptop GPU llama 7B Q4_0 No pp512 1654.38 1658.14 1.00
RTX 3060 Laptop GPU llama 7B Q4_0 No tg128 66.71 66.82 1.00
RTX 3060 Laptop GPU llama 7B Q4_0 Yes pp512 288.97 295.97 1.02
RTX 3060 Laptop GPU llama 7B Q4_0 Yes tg128 54.52 54.90 1.01

fmz avatar Aug 12 '24 21:08 fmz

This is with openmp disabled on the threadpool branch:

$ LLAMA_CUDA=1 ./scripts/compare-commits.sh master threadpool -nkvo 0,1 -m models/7B/llama7b.gguf

GPU Model NKVO Test t/s master t/s threadpool Speedup
RTX 3060 Laptop GPU llama 7B Q4_0 No pp512 1657.23 1659.78 1.00
RTX 3060 Laptop GPU llama 7B Q4_0 No tg128 66.77 66.24 0.99
RTX 3060 Laptop GPU llama 7B Q4_0 Yes pp512 302.17 301.35 1.00
RTX 3060 Laptop GPU llama 7B Q4_0 Yes tg128 55.02 54.87 1.00

fmz avatar Aug 12 '24 23:08 fmz

can confirm it's slightly worse on stories 260K:

GPU Model NKVO Test t/s master t/s threadpool Speedup
RTX 3060 Laptop GPU llama ?B all F32 (guessed) No pp512 199562.93 193086.43 0.97
RTX 3060 Laptop GPU llama ?B all F32 (guessed) No tg128 2517.85 2399.51 0.95
RTX 3060 Laptop GPU llama ?B all F32 (guessed) Yes pp512 12702.00 12819.76 1.01
RTX 3060 Laptop GPU llama ?B all F32 (guessed) Yes tg128 1646.64 1628.08 0.99

fmz avatar Aug 13 '24 00:08 fmz

@slaren

@fmz and I worked on further improvements (removing special cases, reducing branches, etc) and at this point it seems like it should be good to merge.

I believe the BLAS/BLIS backend might need further work. I took a look at it and realized that ggml-blas.c wants a generic threadpool that executes arbitrary functions. The threadpool we've added so far is designed specifically for graph_compute. It's of course possible to update it and make it more generic, assuming there is interest in updating the BLAS/BLIS backend. From my testing it seems to be generally much slower, so I'm not sure how much we want to invest into it. Perhaps, we can just add a check in the make/cmake that BLAS backend requires OMP for now?

Perf numbers on the Snapgragons and the M2 are a bit better but overall similar to what I shared above, the perf profiles are looking cleaner though, things like total branches, missed branches, etc.

Here is a fresh run from Ryzen 9 3950X + RTX 3080 Ubuntu 22.04
Testing that nkvo scenario that you had regressions with before.

GGML_CUDA=1 GGML_NO_OPENMP=1 make -j16 llama-cli llama-bench
llama.cpp-master$ nice -20 ./llama-bench -m ../gguf/llama-v3.1.q4_0.gguf -ngl 0,99 -nkvo 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
model size params backend ngl nkvo test t/s
llama 8B Q4_0 4.33 GiB 8.03 B CUDA 0 0 pp512 1090.70 ± 22.64
llama 8B Q4_0 4.33 GiB 8.03 B CUDA 0 0 tg128 10.01 ± 0.01
llama 8B Q4_0 4.33 GiB 8.03 B CUDA 0 1 pp512 527.11 ± 0.31
llama 8B Q4_0 4.33 GiB 8.03 B CUDA 0 1 tg128 10.03 ± 0.00
llama 8B Q4_0 4.33 GiB 8.03 B CUDA 99 0 pp512 4384.25 ± 13.19
llama 8B Q4_0 4.33 GiB 8.03 B CUDA 99 0 tg128 110.16 ± 0.12
llama 8B Q4_0 4.33 GiB 8.03 B CUDA 99 1 pp512 798.64 ± 6.35
llama 8B Q4_0 4.33 GiB 8.03 B CUDA 99 1 tg128 95.40 ± 0.15

build: 06943a69 (3581)

GGML_CUDA=1 GGML_NO_OPENMP=1 make -j16 llama-cli llama-bench
llama.cpp-threadpool$ nice -20 ./llama-bench -m ../gguf/llama-v3.1.q4_0.gguf -ngl 0,99 -nkvo 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
model size params backend ngl nkvo test t/s
llama 8B Q4_0 4.33 GiB 8.03 B CUDA 0 0 pp512 1114.06 ± 0.45
llama 8B Q4_0 4.33 GiB 8.03 B CUDA 0 0 tg128 9.97 ± 0.02
llama 8B Q4_0 4.33 GiB 8.03 B CUDA 0 1 pp512 534.33 ± 0.28
llama 8B Q4_0 4.33 GiB 8.03 B CUDA 0 1 tg128 9.99 ± 0.01
llama 8B Q4_0 4.33 GiB 8.03 B CUDA 99 0 pp512 4369.85 ± 8.77
llama 8B Q4_0 4.33 GiB 8.03 B CUDA 99 0 tg128 109.93 ± 0.11
llama 8B Q4_0 4.33 GiB 8.03 B CUDA 99 1 pp512 824.77 ± 6.40
llama 8B Q4_0 4.33 GiB 8.03 B CUDA 99 1 tg128 96.29 ± 0.22

build: 9cd5a61d (3599)

I see that one of the server tests failed in the CI. I just ran the same thing locally and can't reproduce the failure. Will keep an eye on it.

max-krasnyansky avatar Aug 14 '24 02:08 max-krasnyansky

The BLAS backend is still important at least in macOS because Accelerate is significantly faster. OpenMP is also not available in macOS. In my opinion this is the most important use case, because other platforms have access to good implementations of OpenMP.

I think I hit a deadlock when testing with LLAMA_DEBUG=1 GGML_BLIS=1 GGML_NO_OPENMP=1 GGML_NO_LLAMAFILE=1:

./llama-bench -m models/stories260K.gguf -r 10 -t 16

Note: 16 other threads idling in OpenMP (from BLIS) omitted .

Thread 16 (Thread 0x79f5dfbed6c0 (LWP 22390) "llama-bench"): #0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335 #1 __cpu_relax () at ggml/src/ggml.c:3060 #2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabf428) at ggml/src/ggml.c:19206 #4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabf428) at ggml/src/ggml.c:19296 #5 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #6 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 15 (Thread 0x79f5e03ee6c0 (LWP 22389) "llama-bench"): #0 0x00005799cc04587a in __cpu_relax () at ggml/src/ggml.c:3061 #1 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #2 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabf200) at ggml/src/ggml.c:19206 #3 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabf200) at ggml/src/ggml.c:19296 #4 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #5 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 14 (Thread 0x79f5e0bef6c0 (LWP 22388) "llama-bench"): #0 ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3139 #1 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabefd8) at ggml/src/ggml.c:19206 #2 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabefd8) at ggml/src/ggml.c:19296 #3 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #4 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 13 (Thread 0x79f5e13f06c0 (LWP 22387) "llama-bench"): #0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335 #1 __cpu_relax () at ggml/src/ggml.c:3060 #2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabedb0) at ggml/src/ggml.c:19206 #4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabedb0) at ggml/src/ggml.c:19296 #5 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #6 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 12 (Thread 0x79f5e1bf16c0 (LWP 22386) "llama-bench"): #0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335 #1 __cpu_relax () at ggml/src/ggml.c:3060 #2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabeb88) at ggml/src/ggml.c:19206 #4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabeb88) at ggml/src/ggml.c:19296 #5 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #6 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 11 (Thread 0x79f5e23f26c0 (LWP 22385) "llama-bench"): #0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335 #1 __cpu_relax () at ggml/src/ggml.c:3060 #2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabe960) at ggml/src/ggml.c:19206 #4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabe960) at ggml/src/ggml.c:19296 #5 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #6 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 10 (Thread 0x79f5e2bf36c0 (LWP 22384) "llama-bench"): #0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335 #1 __cpu_relax () at ggml/src/ggml.c:3060 #2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabe738) at ggml/src/ggml.c:19206 #4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabe738) at ggml/src/ggml.c:19296 #5 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #6 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 9 (Thread 0x79f5e33f46c0 (LWP 22383) "llama-bench"): #0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335 #1 __cpu_relax () at ggml/src/ggml.c:3060 #2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabe510) at ggml/src/ggml.c:19206 #4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabe510) at ggml/src/ggml.c:19296 #5 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #6 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 8 (Thread 0x79f5e3bf56c0 (LWP 22382) "llama-bench"): #0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335 #1 __cpu_relax () at ggml/src/ggml.c:3060 #2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabe2e8) at ggml/src/ggml.c:19206 #4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabe2e8) at ggml/src/ggml.c:19296 #5 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #6 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 7 (Thread 0x79f5e43f66c0 (LWP 22381) "llama-bench"): #0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335 #1 __cpu_relax () at ggml/src/ggml.c:3060 #2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabe0c0) at ggml/src/ggml.c:19206 #4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabe0c0) at ggml/src/ggml.c:19296 #5 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #6 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 6 (Thread 0x79f5e4bf76c0 (LWP 22380) "llama-bench"): #0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335 #1 __cpu_relax () at ggml/src/ggml.c:3060 #2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabde98) at ggml/src/ggml.c:19206 #4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabde98) at ggml/src/ggml.c:19296 #5 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #6 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 5 (Thread 0x79f5e53f86c0 (LWP 22379) "llama-bench"): #0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335 #1 __cpu_relax () at ggml/src/ggml.c:3060 #2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabdc70) at ggml/src/ggml.c:19206 #4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabdc70) at ggml/src/ggml.c:19296 #5 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #6 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 4 (Thread 0x79f5e5bf96c0 (LWP 22378) "llama-bench"): #0 0x00005799cc04587a in __cpu_relax () at ggml/src/ggml.c:3061 #1 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #2 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabda48) at ggml/src/ggml.c:19206 #3 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabda48) at ggml/src/ggml.c:19296 #4 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #5 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 3 (Thread 0x79f5e63fa6c0 (LWP 22377) "llama-bench"): #0 0x00005799cc04587a in __cpu_relax () at ggml/src/ggml.c:3061 #1 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #2 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabd820) at ggml/src/ggml.c:19206 #3 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabd820) at ggml/src/ggml.c:19296 #4 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #5 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 2 (Thread 0x79f5e6bfb6c0 (LWP 22376) "llama-bench"): #0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335 #1 __cpu_relax () at ggml/src/ggml.c:3060 #2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabd5f8) at ggml/src/ggml.c:19206 #4 0x00005799cc080ac4 in ggml_graph_compute_secondary_thread (data=0x5799cdabd5f8) at ggml/src/ggml.c:19296 #5 0x000079f5e8a97b5a in start_thread (arg=) at ./nptl/pthread_create.c:444 #6 0x000079f5e8b285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 1 (Thread 0x79f5e9dc4c40 (LWP 22375) "llama-bench"): #0 _mm_pause () at /usr/lib/gcc/x86_64-linux-gnu/12/include/xmmintrin.h:1335 #1 __cpu_relax () at ggml/src/ggml.c:3060 #2 0x00005799cc045960 in ggml_barrier (threadpool=0x5799cda65990) at ggml/src/ggml.c:3142 #3 0x00005799cc080791 in ggml_graph_compute_thread (data=0x5799cdabd3d0) at ggml/src/ggml.c:19206 #4 0x00005799cc0813a2 in ggml_graph_compute (cgraph=0x5799cda66998, cplan=0x7ffc9d6a28a0) at ggml/src/ggml.c:19502 #5 0x00005799cc092a1a in ggml_backend_cpu_graph_compute (backend=0x5799cda872b0, cgraph=0x5799cda66998) at ggml/src/ggml-backend.c:817 #6 0x00005799cc091807 in ggml_backend_graph_compute_async (backend=0x5799cda872b0, cgraph=0x5799cda66998) at ggml/src/ggml-backend.c:282 #7 0x00005799cc0966b4 in ggml_backend_sched_compute_splits (sched=0x5799cda5f370) at ggml/src/ggml-backend.c:1805 #8 0x00005799cc0972c8 in ggml_backend_sched_graph_compute_async (sched=0x5799cda5f370, graph=0x79f5e82df030) at ggml/src/ggml-backend.c:1992 #9 0x00005799cc1295a8 in llama_graph_compute (lctx=..., gf=0x79f5e82df030, n_threads=16, threadpool=0x5799cda65990) at src/llama.cpp:14527 #10 0x00005799cc12a344 in llama_decode_internal (lctx=..., batch_all=...) at src/llama.cpp:14781 #11 0x00005799cc1382f1 in llama_decode (ctx=0x5799cda5fe80, batch=...) at src/llama.cpp:18600 #12 0x00005799cc354d9b in test_prompt (ctx=0x5799cda5fe80, n_prompt=512, n_past=0, n_batch=2048, n_threads=16) at examples/llama-bench/llama-bench.cpp:1349 #13 0x00005799cc3558a6 in main (argc=7, argv=0x7ffc9d6a3798) at examples/llama-bench/llama-bench.cpp:1485

slaren avatar Aug 15 '24 18:08 slaren

@slaren

The BLAS backend is still important at least in macOS because Accelerate is significantly faster. OpenMP is also not available in macOS. In my opinion this is the most important use case, because other platforms have access to good implementations of OpenMP.

Got it. Makes sense for the BLAS then.

For other platforms there are several other advantages of using dedicated threadpool vs OpenMP. Things like ability to specify the affinity masks, priorities, etc specific to llama.cpp/ggml instances. With OpenMP those settings are global per process. ie if an app that links to libllama.so/libggml.so uses OpenMP for other stuff (say it's linking some other lib that uses OpenMP) then the settings conflict with each other. There are other things like being able to reuse threadpools between llama_ctx, reduced dependencies, etc.

I think I hit a deadlock when testing with LLAMA_DEBUG=1 GGML_BLIS=1 GGML_NO_OPENMP=1 GGML_NO_LLAMAFILE=1:

./llama-bench -m models/stories260K.gguf -r 10 -t 16

Oh. Odd. I thought I tested that use-case. Will follow up asap.

max-krasnyansky avatar Aug 15 '24 18:08 max-krasnyansky

@slaren Sorry for not catching this earlier. The timing had to be just right to trigger that race. I reproduced it with while true; do ./llama-bench -m ../gguf/stories260K.gguf -r 10 -t 16; done on my Ryzen9 system. Fixed it, and let that loop run for a couple of hours to make sure.

Do you think we can merge this more or less as is and then work on extending things to accommodate the BLAS backend as a follow up? See above for general benefits vs OpenMP, and it speeds things up on the ARM64 CPUs, see my report above. It'd be good to get Windows on ARM, and Android releases going with the threadpool enabled. We'll definitely follow up on the BLAS and there are further ideas as well (reusing temporary pools, etc).

max-krasnyansky avatar Aug 15 '24 23:08 max-krasnyansky

@slaren Another quick question. ggml-blas.cpp is C++ and is using C++11 stuff like std::future when OpenMP is disabled.

Would it be OK to do Thread Pool V3 in C++? We can add some extern "C" APIs to call from ggml.c but it'd be nice if the core threadpool logic was in C++ (with clean std::atomic, std::thread, ...). This way we could remove pthread wrappers and things, we'll still need a few OS specific functions for the CPU affinity and priority stuff but the core bits will just be clean C++11. We'd create ggml-thread.cpp and implement all threading/cpu/numa related stuff in there, again with some extern "C" APIs for the rest of GGML.

We could do this as the followup to this current Thread Pool V2 version. Please see the question/suggestion above.

max-krasnyansky avatar Aug 16 '24 00:08 max-krasnyansky

The deadlock also seems to be fixed here. I think we can merge if there aren't any significant performance regressions. I will do a more in depth review in the following days, so far I have only looked at the performance. Using C++ would be good as long as the public ggml interface remains compatible with C, in the future we will probably continue porting parts of ggml to C++.

slaren avatar Aug 16 '24 01:08 slaren

As an extra data point: I'm not seeing a performance regression on this branch on my EPYC system. I'm seeing a single-digit percentage speedup vs master, in fact.

cpumaxx avatar Aug 19 '24 19:08 cpumaxx

@ggerganov @slaren Do you have any more suggestions/comments/concerns regarding this PR? I would suggest we merge it in and create issues to track BLAS/BLIS improvements and/or moving to C++ synchronization primitives

fmz avatar Aug 23 '24 17:08 fmz

Not critical, but I noticed that there is a performance regression with partial offloading (-ngl 10) with Metal, at least with small models: scripts/compare-commits.sh master threadpool -m models/tinyllama-1.1b-intermediate-step-480k-1t.Q8_0.gguf -m models/llama-2-7b/ggml-model-Q4_0.gguf -t 4,8 -ngl 10

CPU Model Model Size [GiB] Threads Test t/s master t/s threadpool Speedup
M3 Max llama 1B Q8_0 1.09 4 pp512 1163.31 1067.73 0.92
M3 Max llama 1B Q8_0 1.09 4 tg128 102.36 83.20 0.81
M3 Max llama 1B Q8_0 1.09 8 pp512 1292.88 1184.32 0.92
M3 Max llama 1B Q8_0 1.09 8 tg128 104.22 89.58 0.86
M3 Max llama 7B Q4_0 3.56 4 pp512 185.83 185.25 1.00
M3 Max llama 7B Q4_0 3.56 4 tg128 24.28 24.21 1.00
M3 Max llama 7B Q4_0 3.56 8 pp512 194.85 191.47 0.98
M3 Max llama 7B Q4_0 3.56 8 tg128 30.21 31.09 1.03

slaren avatar Aug 24 '24 01:08 slaren