Hi LLAMA team,

I use llama tag b2581 on Windows ARM64 PC, the performance is more lower than previous tag b1500. Please refer to below detailed information. What is the reason? Please help on this issue.

Thanks a lot!

[Detailed information]

Command: main.exe -m llama-2-7b-chat.ggufv3.q4_0.bin --color --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1.1 -t 10

Prompt: I have 3 years of experience as a software developer. Now I got bored with coding and want to transition to another career. My education qualifications are B. Tech in computer science, and I am well-versed in understanding the business side of software as well. Suggest a list of career options that are easy for me to transition.

system_info: n_threads = 10 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |

Tag b1500 results: llama_print_timings: load time = 723.53 ms llama_print_timings: sample time = 925.29 ms / 624 runs ( 1.48 ms per token, 674.38 tokens per second) llama_print_timings: prompt eval time = 2583.12 ms / 91 tokens ( 28.39 ms per token, 35.23 tokens per second) llama_print_timings: eval time = 31693.17 ms / 625 runs ( 50.71 ms per token, 19.72 tokens per second) llama_print_timings: total time = 51797.58 ms

Tag b2581 results: llama_print_timings: load time = 963.25 ms llama_print_timings: sample time = 416.14 ms / 586 runs ( 0.71 ms per token, 1408.17 tokens per second) llama_print_timings: prompt eval time = 11847.94 ms / 94 tokens ( 126.04 ms per token, 7.93 tokens per second) llama_print_timings: eval time = 68542.50 ms / 585 runs ( 117.17 ms per token, 8.53 tokens per second) llama_print_timings: total time = 82696.57 ms / 679 tokens

Apr 01 '24 03:04 Billzhong2022

Which platform is that on? Snapdragon 8cx Gen 3 I presume?

Apr 01 '24 10:04 woachk

Hi LLAMA team,

I use platform https://www.qualcomm.com/products/mobile/snapdragon/pcs-and-tablets/snapdragon-x-elite.

Thanks!

Apr 02 '24 03:04 Billzhong2022

If you want any chance of getting this fixed, do a git bisect to identify the exact commit that caused performance regression and notify the corresponding dev.

Apr 02 '24 10:04 JohannesGaessler

Hi LLAMA team,

I tried to do "git bisect" to find root reason for it, but there're huge patches added between tag b1500 and b2581.

Can you please help check and analyze this issue?

Thanks!

Apr 03 '24 03:04 Billzhong2022

I tried to do "git bisect" to find root reason for it, but there're huge patches added between tag b1500 and b2581.

Download the model as the original weights and convert it from that at each git bisect iteration.

Can you please help check and analyze this issue?

As I said, if you cannot identify the commit that is causing the performance regression this has basically no chance of getting fixed. This issue seems to be hardware-specific so without the corresponding hardware it is impossible to find the bad commit. If there was a performance regression for the hardware of any of the devs they would have already reported it.

Apr 03 '24 08:04 JohannesGaessler

I get significantly lower perf than expected on 8cx Gen 3 too, and can also get access to X Elite hardware.

Apr 03 '24 10:04 woachk

Hi LLAMA team,

Is it related with FP16 or NEON feature?

Thanks!

Apr 04 '24 08:04 Billzhong2022

Hi LLAMA team,

Do you find more useful information for this issue?

Thanks!

Apr 07 '24 11:04 Billzhong2022

As I've said two times already: without a git bisect from one of the affected people nothing is going to happen.

Apr 07 '24 12:04 JohannesGaessler

Hi LLAMA team,

I've found the commit "780e24a" between tag b1951 & b1952 caused performance decreated on Windows ARM64 PC.

Please help on this issue continually.

Thanks!

[Commit]

commit 780e24a22eb595b705cbe8284771e9ceff1c4dd2 Author: Reinforce-II [email protected] Date: Mon Jan 22 21:15:08 2024 +0800

ggml : parallelize FP32 conversion when using BLAS (#5045)

* make GGML_TASK_INIT phase can be run in multithread

* multithreaded dequantize in mul_mat when using blas library

* minor fixes

* update outdated comment
* fix coding style

* simplify code

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>

==============================================================================

Please refer to below test results.

[Test results]

Command: main.exe -m llama-2-7b-chat.ggufv3.q4_0.bin --color --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1.1 -t 10

Prompt: I have 3 years of experience as a software developer. Now I got bored with coding and want to transition to another career. My education qualifications are B. Tech in computer science, and I am well-versed in understanding the business side of software as well. Suggest a list of career options that are easy for me to transition.

system_info: n_threads = 10 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |

Tag b1951 results: llama_print_timings: load time = 839.94 ms llama_print_timings: sample time = 895.91 ms / 568 runs ( 1.58 ms per token, 633.99 tokens per second) llama_print_timings: prompt eval time = 2770.04 ms / 93 tokens ( 29.79 ms per token, 33.57 tokens per second) llama_print_timings: eval time = 28958.15 ms / 568 runs ( 50.98 ms per token, 19.61 tokens per second) llama_print_timings: total time = 43121.95 ms / 661 tokens

Tag b1952 results: llama_print_timings: load time = 1381.51 ms llama_print_timings: sample time = 1033.81 ms / 552 runs ( 1.87 ms per token, 533.95 tokens per second) llama_print_timings: prompt eval time = 9733.86 ms / 93 tokens ( 104.67 ms per token, 9.55 tokens per second) llama_print_timings: eval time = 69562.43 ms / 552 runs ( 126.02 ms per token, 7.94 tokens per second) llama_print_timings: total time = 83113.79 ms / 645 tokens

Tag b2581 default results: llama_print_timings: load time = 963.25 ms llama_print_timings: sample time = 416.14 ms / 586 runs ( 0.71 ms per token, 1408.17 tokens per second) llama_print_timings: prompt eval time = 11847.94 ms / 94 tokens ( 126.04 ms per token, 7.93 tokens per second) llama_print_timings: eval time = 68542.50 ms / 585 runs ( 117.17 ms per token, 8.53 tokens per second) llama_print_timings: total time = 82696.57 ms / 679 tokens

Tag b2581 after removing commit 780e24a2 results: llama_print_timings: load time = 1055.60 ms llama_print_timings: sample time = 436.21 ms / 621 runs ( 0.70 ms per token, 1423.64 tokens per second) llama_print_timings: prompt eval time = 18649.48 ms / 94 tokens ( 198.40 ms per token, 5.04 tokens per second) llama_print_timings: eval time = 32446.21 ms / 620 runs ( 52.33 ms per token, 19.11 tokens per second) llama_print_timings: total time = 53483.55 ms / 714 tokens

Apr 08 '24 09:04 Billzhong2022

I am seeing virtually no difference with 6.6.19-1-MANJARO and a Ryzen 5950X.

Apr 08 '24 14:04 JohannesGaessler

@Billzhong2022 what if you limit the cores used through -C to the two "big" complexes?

Apr 08 '24 14:04 woachk

@Billzhong2022 could you provide those testing results

running on a single cluster, e.g. taskset -c 0-3 <...>, or something equivalent on windows
use -t 4~~, and set environemnt variable like OPENBLAS_NUM_THREADS=4 if you are using blas backend~~

Apr 08 '24 15:04 ReinForce-II

Hi LLAMA team,

Do you have patch to fix this issue?

Thanks!

Apr 09 '24 14:04 Billzhong2022

Hi LLAMA team,

Any update? Do you use MACRO "GGML_USE_OPENBLAS" for all your modified codes in commit https://github.com/ggerganov/llama.cpp/commit/780e24a22eb595b705cbe8284771e9ceff1c4dd2?

@ReinForce-II use -t 4: The performance is not good.

Thanks!

Apr 14 '24 08:04 Billzhong2022

Hi LLAMA team,

Any update? Do you use MACRO "GGML_USE_OPENBLAS" for all your modified codes in commit 780e24a?

@ReinForce-II use -t 4: The performance is not good.

Thanks!

How about 1.? You can run cmd /c start /b /affinity 0xf main.exe -t 4 ... in powershell, better if you can run cmd /c start /b /affinity 0x1e main.exe -t 4 ... additionally

thanks.

Apr 15 '24 03:04 ReinForce-II

Hi @ReinForce-II ,

How about 1.? You can run cmd /c start /b /affinity 0xf main.exe -t 4 ... in powershell, Answer: The performance is very very bad.

better if you can run cmd /c start /b /affinity 0x1e main.exe -t 4 ... additionally Answer: The performance is bad.

Please help on this issue with high priority.

Thanks!

Apr 15 '24 10:04 Billzhong2022

Hi @ReinForce-II ,

How about 1.? You can run cmd /c start /b /affinity 0xf main.exe -t 4 ... in powershell, Answer: The performance is very very bad.

better if you can run cmd /c start /b /affinity 0x1e main.exe -t 4 ... additionally Answer: The performance is bad.

Please help on this issue with high priority.

Thanks!

This issue might to be specific to qualcomm laptop platforms. I haven't been able to reproduce it on several arm-based hardwares such as ampere, rockchip, graviton. Sorry, but I currently have no access to either 8cx gen3 or x elite, so I have no idea about it now.

Apr 16 '24 04:04 ReinForce-II

For reference: the X Elite has 3 4-core clusters.

The cores across the 3 clusters are identical but the "lower power" cluster has 1/2 the link to fabric as the two other ones, and as such can only use half as much memory bandwidth.

For applications that expect somewhat uniform throughput between the cores, that can cause a breakdown in expected versus realised performance.

Apr 16 '24 23:04 woachk

Hi @ReinForce-II ,

But after reverting commit https://github.com/ggerganov/llama.cpp/commit/780e24a22eb595b705cbe8284771e9ceff1c4dd2, the performance is much better on platform https://www.qualcomm.com/products/mobile/snapdragon/pcs-and-tablets/snapdragon-x-elite. How to explain it?

Thanks!

Apr 22 '24 07:04 Billzhong2022

Hi @ReinForce-II ,

But after reverting commit 780e24a, the performance is much better on platform https://www.qualcomm.com/products/mobile/snapdragon/pcs-and-tablets/snapdragon-x-elite. How to explain it?

Thanks!

It might have something to do with more synchronize operations in the commit

Apr 22 '24 07:04 ReinForce-II

Hi @ReinForce-II ,

Ok. Is it fixed?

Thanks!

Apr 24 '24 07:04 Billzhong2022

Hi @ReinForce-II ,

Please help debug and fix this issue.

Thank you very much!

Apr 29 '24 10:04 Billzhong2022

Hi @ReinForce-II ,

Please help debug and fix this issue.

Thank you very much!

It would be great help if you can kindly provide some sampling results from snapdragon profiler otherwise, there's a higher likelihood of the process getting stuck here, until the x elite products become publicly availiable

Apr 29 '24 10:04 ReinForce-II

Hi, @Billzhong2022

Please take a look at 4ae60ad8 The commit is not specified for snapdragon device, but it might also alleviate your problem.

hope for your feedback.

May 14 '24 11:05 ReinForce-II

Hi @ReinForce-II ,

After applying patch 4ae60ad8, the llama2 performance is not good on Snapdragon X Elite device.

Please dig out this issue further.

Thanks!

Logs: llama_print_timings: load time = 3453.65 ms llama_print_timings: sample time = 372.97 ms / 658 runs ( 0.57 ms per token, 1764.23 tokens per second) llama_print_timings: prompt eval time = 15051.02 ms / 94 tokens ( 160.12 ms per token, 6.25 tokens per second) llama_print_timings: eval time = 115842.64 ms / 657 runs ( 176.32 ms per token, 5.67 tokens per second) llama_print_timings: total time = 143545.05 ms / 751 tokens

May 15 '24 08:05 Billzhong2022

Hi @ReinForce-II ,

Can you please refine commit https://github.com/ggerganov/llama.cpp/commit/780e24a22eb595b705cbe8284771e9ceff1c4dd2 directly?

Thank you very much!

May 16 '24 07:05 Billzhong2022

@ReinForce-II It seems the performance issue is due to the function 'ggml_compute_forward()' was called twice in function 'ggml_graph_compute_thread()'. It was just called once in previous code. May we avoid this? Thanks in advance! By the way, I haven't enable Blas feature.

May 19 '24 23:05 quic-zhanweiw

@ReinForce-II It seems the performance issue is due to the function 'ggml_compute_forward()' was called twice in function 'ggml_graph_compute_thread()'. It was just called once in previous code. May we avoid this? Thanks in advance! By the way, I haven't enable Blas feature.

In previous code, ggml_compute_forward is called once with GGML_TASK_INIT state, by only one of the threads, the remaining threads will spin waiting it returns. Then called once with GGML_TASK_COMPUTE state, by all threads. After, both GGML_TASK_INIT and GGML_TASK_COMPUTE are called by all threads. In GGML_TASK_INIT state, if you are running w/o BLAS feature, one of the threads will do the work, the remainging threads returns immediately and spin waiting it. I think there is no difference in this part.

May 20 '24 02:05 ReinForce-II

Thanks @ReinForce-II for your update! In the latest Llama.cpp code, after add the patch below, the performance issue disappeared. May you help check the reason?

diff --git a/ggml.c b/ggml.c
index b96a82a4..f71369f6 100644
--- a/ggml.c
+++ b/ggml.c
@@ -19494,13 +19494,13 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
 
                 params.nth = n_tasks;
 
-                if (n_tasks == 1) {
-                    /* INIT */
-                    if (GGML_OP_HAS_INIT[node->op]) {
-                        params.type = GGML_TASK_TYPE_INIT;
-                        ggml_compute_forward(&params, node);
-                    }
+                /* INIT */
+                if (GGML_OP_HAS_INIT[node->op]) {
+                    params.type = GGML_TASK_TYPE_INIT;
+                    ggml_compute_forward(&params, node);
+                }
 
+                if (n_tasks == 1) {
                     // TODO: maybe push node_n to the atomic but if other threads see n_tasks is 1,
                     // they do something more efficient than spinning (?)
                     params.type = GGML_TASK_TYPE_COMPUTE;
@@ -19524,10 +19524,10 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
             task_phase = GGML_TASK_TYPE_INIT;
             atomic_store(&state->shared->n_active,  n_threads);
             atomic_store(&state->shared->node_n,    node_n);
-            atomic_store(&state->shared->node_task, task_phase);
+            // atomic_store(&state->shared->node_task, task_phase);
         } else {
             ggml_graph_compute_thread_sync_node(&node_n,     state, false);
-            ggml_graph_compute_thread_sync_task(&task_phase, state, false);
+            // ggml_graph_compute_thread_sync_task(&task_phase, state, false);
         }
 
         // check if we should stop
@@ -19538,13 +19538,17 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
         const int n_tasks = ggml_get_n_tasks(node, n_threads, state->shared->n_threads);
 
         struct ggml_compute_params params = {
-            /*.type  =*/ GGML_TASK_TYPE_INIT,
+            /*.type  =*/ GGML_TASK_TYPE_COMPUTE,
             /*.ith   =*/ state->ith,
             /*.nth   =*/ n_tasks,
             /*.wsize =*/ cplan->work_size,
             /*.wdata =*/ cplan->work_data,
         };
 
+        if (state->ith < n_tasks) {
+            ggml_compute_forward(&params, node);
+        }
+/*
         if (state->ith < n_tasks) {
             if (GGML_OP_HAS_INIT[node->op]) {
                 ggml_compute_forward(&params, node);
@@ -19579,6 +19583,7 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
         else {
             ggml_graph_compute_thread_sync_task(&task_phase, state, false);
         }
+*/
     }
 
     return 0;

May 20 '24 02:05 quic-zhanweiw

Performance decreated between tag b1500 and b2581 on Windows ARM64 PC

[Commit]