whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

Very slow - any way to speed up?

Open luke-jr opened this issue 2 years ago • 22 comments

Per #10,

Noting that the processing time is considerably shorter than the length of speech,

Yet even using 64 threads, it's taking days to process minutes of audio on my POWER9.

Has something changed since #10, or is there something I am doing wrong?

luke-jr avatar Dec 21 '22 15:12 luke-jr

What is your platform?

RndyP avatar Dec 22 '22 15:12 RndyP

@luke-jr I'm not familiar with POWER9, but from a quick ChatGPT search, it seems this CPU has a RISC architecture:

image

Currently, whisper.cpp supports only x86 and ARM architectures. By support, it means that it uses the available SIMD instruction set to make the computation efficient. On other architectures, it will fallback to non-SIMD computation which is multiple times slower.

Adding support for Power ISA (or whatever the instruction set is called) should not be very difficult. The matrix multiplication routines in ggml.c need to be extended to support the respective instruction set and the corresponding compile flags added to the Makefile.

I don't have experience with this architecture, so hopefully someone contributes. It will be very interesting to see what is the performance on these CPUs.

ggerganov avatar Dec 22 '22 15:12 ggerganov

Yeah, I'm not surprised it isn't optimised for PPC64, but I wouldn't expect it to be magnitudes slower either. Real-time to days is a huge difference. :/

luke-jr avatar Dec 22 '22 19:12 luke-jr

For example on my Ryzen 9 5950X if I remove the -mavx -mavx2 -mfma -mf16c flags I observed about x50 slower computation of the bench tool. Removing those flags is similar to what you have on the PPC64 - no SIMD, no F16C support.

SIMD can make a huge difference

ggerganov avatar Dec 22 '22 20:12 ggerganov

ChatGPT is out-of-date regarding the Power ISA being proprietary. It is open source now, just like RISC-V. See https://openpowerfoundation.org/.

fitzsim avatar Dec 23 '22 14:12 fitzsim

After #320, ./main -m models/ggml-base.en.bin -f samples/jfk.wav takes 15.7 seconds.

Additional options Time
-p 64 77s
-t 64 28s
-t 1 59.6s
-t 16 5.8s
-t 32 5.1s
-m models/ggml-large.bin -t 32 111.1s

ChatGPT is out-of-date regarding the Power ISA being proprietary.

In my experience, ChatGPT tends to be wrong quite often.

luke-jr avatar Dec 23 '22 14:12 luke-jr

@fitzsim @luke-jr I am planning to merge a refactored version of the SIMD routines in ggml which I think will make things easier to maintain in the future. The PR is pretty much ready in #324

All instruction sets fit quite nicely in the proposed pattern, but I'm having a little trouble with the ppc64le stuff since I don't have a way to test it. So for the moment, I've special-cased it, which is not great.

If you are interested and have some free time, you can take a look at the implementation and see if you can fit it in the new pattern. Or at the very least - run a test and see that it still works after the changes.

Regarding the new performance: 5s for jfk.wav using base still seems quite a lot. Not sure why the performance is so bad

ggerganov avatar Dec 23 '22 17:12 ggerganov

@ggerganov, sure, I'll try to fit the POWER9 optimizations into the main SIMD structure, some time after #324 lands in the master branch.

Agreed regarding 5s likely not being optimal. @luke-jr, can you add the whisper_print_timings lines to your table? They may contain hints about further optimization efforts.

fitzsim avatar Dec 23 '22 22:12 fitzsim

$ time ./main -m models/ggml-base.en.bin -f samples/jfk.wav -t 32
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: adding 1607 extra tokens
whisper_model_load: mem_required  =  506.00 MB
whisper_model_load: ggml ctx size =  140.60 MB
whisper_model_load: memory size   =   22.83 MB
whisper_model_load: model size    =  140.54 MB

system_info: n_threads = 32 / 64 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 32 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   110.77 ms
whisper_print_timings:      mel time =    49.83 ms
whisper_print_timings:   sample time =     8.41 ms
whisper_print_timings:   encode time =  3631.16 ms / 605.19 ms per layer
whisper_print_timings:   decode time =  1374.97 ms / 229.16 ms per layer
whisper_print_timings:    total time =  5175.76 ms

real    0m5.187s
user    2m31.675s
sys     0m1.078s

luke-jr avatar Dec 23 '22 22:12 luke-jr

The remaining slowness seems to be in the short-to-fp32 conversion. Would it make sense to try a GGML_TYPE_F32 version of ggml-base.en.bin, to eliminate the conversion steps? Can someone outline steps for trying that?

fitzsim avatar Dec 31 '22 05:12 fitzsim

The steps are like this:

# we need this for the f32 conversion
git clone https://github.com/openai/whisper

# create f32 ggml model (assumes you have ~/.cache/whisper/base.en.pt downloaded from original repo)
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
python3 models/convert-pt-to-ggml.py ~/.cache/whisper/base.en.pt ../whisper . 1

# use the new f32 model
make -j
./main -m ./ggml-model-f32.bin samples/jfk.wav

You need the following patch/hack in whisper.cpp to increase the memory buffers:

diff --git a/whisper.cpp b/whisper.cpp
index 84c2490..8709723 100644
--- a/whisper.cpp
+++ b/whisper.cpp
@@ -131,7 +131,7 @@ static const std::map<std::string, std::pair<int, std::string>> g_lang = {
     { "su",  { 98,  "sundanese",      } },
 };
 
-static const size_t MB = 1024*1024;
+static const size_t MB = 3*1024*1024;
 
 static const std::map<e_model, size_t> MEM_REQ_MODEL = {
     { MODEL_TINY,     74ull*MB },

ggerganov avatar Dec 31 '22 08:12 ggerganov

I used the Visual Studio performance profiler to see where all the CPU time is spent. Half the time is in the SIMD code here: image

I reviewed the code for any obvious opportunities for speed up. Nothing major except I believe ax[] and ay[] are not neccessary. You can write the summation like so: sum[j] = GGML_F16_VEC_FMA(sum[j], GGML_F16_VEC_LOAD(x + i + jGGML_F16_EPR), GGML_F16_VEC_LOAD(y + i + jGGML_F16_EPR)); This didn't help the times though; I think the optimizing compiler figures this out on it's own. The other thing that stands out is this: image Not sure if this is an opportunity for improvement. I was thinking instead of looping with while, might want to use an event???

RndyP avatar Jan 02 '23 14:01 RndyP

Thanks for the model instructions @ggerganov.

With the FP32 model and #366 I get:

$ time ./main -t 32 -m ../fp32-model/ggml-model-f32.bin samples/jfk.wav
whisper_model_load: loading model from '../fp32-model/ggml-model-f32.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 0
whisper_model_load: type          = 2
whisper_model_load: adding 1607 extra tokens
whisper_model_load: mem_required  = 1518.00 MB
whisper_model_load: ggml ctx size =  276.98 MB
whisper_model_load: memory size   =   22.83 MB
whisper_model_load: model size    =  276.92 MB

system_info: n_threads = 32 / 64 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 32 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   236.47 ms
whisper_print_timings:      mel time =    42.08 ms
whisper_print_timings:   sample time =     4.75 ms
whisper_print_timings:   encode time =  1945.92 ms / 324.32 ms per layer
whisper_print_timings:   decode time =   933.50 ms / 155.58 ms per layer
whisper_print_timings:    total time =  3163.23 ms

real	0m3.182s
user	1m17.748s
sys	0m0.607s

fitzsim avatar Jan 03 '23 06:01 fitzsim

@fitzsim Great work! Will take a look at the PRs in the following days and merge after I make sure the other platforms work correctly.

ggerganov avatar Jan 03 '23 20:01 ggerganov

Hi, I'm kind of agreeing with @RndyP

I've profiled it few weeks ago and found out that you are using spin locks. I changed it to event and using WaitForMultipleObjects (I'm on windows). CPU usage did tamed down but I didn't bother to bench it at that time.

This is the bench results for commit afe2db0fe2950049a460aa02173aaeb8b4a78e02 The one with Event seems to perform better on my PC.

CPU: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz 2.21 GHz

whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: adding 1607 extra tokens
whisper_model_load: mem_required  =  506.00 MB
whisper_model_load: ggml ctx size =  140.60 MB
whisper_model_load: memory size   =   22.83 MB
whisper_model_load: model size    =  140.54 MB

system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

Spinlock: All cores 100% CPU usage

whisper_print_timings:     load time =   312.28 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2975.37 ms / 495.89 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3288.11 ms

whisper_print_timings:     load time =   285.14 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2932.89 ms / 488.81 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3218.43 ms

whisper_print_timings:     load time =   267.65 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2930.10 ms / 488.35 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3198.02 ms

whisper_print_timings:     load time =   270.98 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2821.18 ms / 470.20 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3092.38 ms

Event: CPU usage tamed

whisper_print_timings:     load time =   308.21 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2791.27 ms / 465.21 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3099.88 ms

whisper_print_timings:     load time =   268.62 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2687.68 ms / 447.95 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  2956.58 ms

whisper_print_timings:     load time =   267.01 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2727.19 ms / 454.53 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  2994.49 ms

whisper_print_timings:     load time =   294.01 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2803.29 ms / 467.22 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3097.75 ms

whisper_print_timings:     load time =   293.43 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2876.54 ms / 479.42 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3170.70 ms

The more recent one seems slower on my PC, without any change to the code: f00509d57cc8e208ad2153aff3fe0af924289abc Spinlock:

whisper_print_timings:     load time =   268.64 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  3209.34 ms / 534.89 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3478.29 ms

whisper_print_timings:     load time =   270.09 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  3391.52 ms / 565.25 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3661.90 ms

whisper_print_timings:     load time =   310.33 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  3375.87 ms / 562.64 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3686.47 ms

prsyahmi avatar Jan 04 '23 00:01 prsyahmi

Can you demonstrate the Event-based Windows implementation? I tried waiting on condition_variable instead of spin locks, but it wasn't more efficient. Maybe I missed something.

ggerganov avatar Jan 05 '23 19:01 ggerganov

@luke-jr Now that #369 is merged can you try bench with various arguments, and post an updated table of results to #89? Then #300 can probably be closed.

fitzsim avatar Jan 05 '23 22:01 fitzsim

@fitzsim We just merged a FP16 lookup-table (#368) that is used when F16C intrinsics are not available. I believe that this will lead to significant improvement on POWER9 platforms using the F16 models.

ggerganov avatar Jan 06 '23 16:01 ggerganov

@ggerganov I'm using WinAPI directly. My intention was to reduce CPU usage, maybe I'll try again with condition_var and see if it makes any diffrence

index c5780ed..7ad5be6 100644
--- "a/ggml.c"
+++ "b/ggml.c"
@@ -1118,7 +1118,44 @@ inline static void ggml_vec_mad_f16(const int n, ggml_fp16_t * restrict y, ggml_
 #endif
 }
 
-inline static void ggml_vec_scale_f32(const int n, float * y, const float   v) { for (int i = 0; i < n; ++i) y[i] *= v;          }
+//inline static void ggml_vec_scale_f32(const int n, float * y, const float   v) { for (int i = 0; i < n; ++i) y[i] *= v;          }
+inline static void ggml_vec_scale_f32(const int n, float * y, const float   v) {
+#if defined(__AVX__) || defined(__AVX2__)
+    // AVX 256-bit
+    const int n32 = (n & ~31);
+
+    const __m256 v4 = _mm256_set1_ps(v);
+
+    __m256 y0, y1, y2, y3;
+
+    for (int i = 0; i < n32; i += 32) {
+        y0 = _mm256_loadu_ps(y + i + 0);
+        y1 = _mm256_loadu_ps(y + i + 8);
+        y2 = _mm256_loadu_ps(y + i + 16);
+        y3 = _mm256_loadu_ps(y + i + 24);
+
+        y0 = _mm256_mul_ps(y0, v4);
+        y1 = _mm256_mul_ps(y1, v4);
+        y2 = _mm256_mul_ps(y2, v4);
+        y3 = _mm256_mul_ps(y3, v4);
+
+        _mm256_storeu_ps(y + i + 0, y0);
+        _mm256_storeu_ps(y + i + 8, y1);
+        _mm256_storeu_ps(y + i + 16, y2);
+        _mm256_storeu_ps(y + i + 24, y3);
+    }
+
+    // leftovers
+    for (int i = n32; i < n; ++i) {
+        y[i] *= v;
+    }
+#else
+    // scalar
+    for (int i = 0; i < n; ++i) {
+        y[i] *= v;
+    }
+#endif
+}
 inline static void ggml_vec_norm_f32 (const int n, float * s, const float * x) { ggml_vec_dot_f32(n, s, x, x); *s = sqrt(*s);   }
 inline static void ggml_vec_sqr_f32  (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = x[i]*x[i];   }
 inline static void ggml_vec_sqrt_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = sqrt(x[i]); }
@@ -1621,7 +1658,7 @@ struct ggml_tensor * ggml_new_tensor_impl(
     size_needed += sizeof(struct ggml_tensor);
 
     if (cur_end + size_needed + GGML_OBJECT_SIZE > ctx->mem_size) {
-        GGML_PRINT("%s: not enough space in the context's memory pool\n", __func__);
+        GGML_PRINT("%s: not enough space in the context's memory pool (%zu/%zu needed)\n", __func__, cur_end + size_needed + GGML_OBJECT_SIZE, ctx->mem_size);
         assert(false);
         return NULL;
     }
@@ -7010,7 +7047,7 @@ typedef int ggml_lock_t;
 
 #define ggml_lock_init(x)    UNUSED(x)
 #define ggml_lock_destroy(x) UNUSED(x)
-#define ggml_lock_lock(x)    UNUSED(x)
+#define ggml_lock_lock(x)    Sleep(1)
 #define ggml_lock_unlock(x)  UNUSED(x)
 
 #define GGML_LOCK_INITIALIZER 0
@@ -7035,6 +7072,9 @@ struct ggml_compute_state {
     struct ggml_tensor * node;
 
     struct ggml_compute_state_shared * shared;
+
+    HANDLE wait_handle;
+    HANDLE fin_handle;
 };
 
 // function used by each compute thread
@@ -7052,6 +7092,17 @@ thread_ret_t ggml_graph_compute_thread(void * data) {
     const int n_threads = state->shared->n_threads;
 
     while (true) {
+        WaitForSingleObject(state->wait_handle, INFINITE);
+        if (state->node) {
+            ggml_compute_forward(&state->params, state->node);
+            state->node = NULL;
+            SetEvent(state->fin_handle);
+        } else {
+            SetEvent(state->fin_handle);
+            break;
+        }
+
+        /*
         if (atomic_fetch_add(&state->shared->n_ready, 1) == n_threads - 1) {
             atomic_store(&state->shared->has_work, false);
         } else {
@@ -7086,6 +7137,7 @@ thread_ret_t ggml_graph_compute_thread(void * data) {
         } else {
             break;
         }
+        */
     }
 
     return 0;
@@ -7106,6 +7158,7 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
         /*.stop      =*/ false,
     };
     struct ggml_compute_state * workers = n_threads > 1 ? alloca(sizeof(struct ggml_compute_state)*(n_threads - 1)) : NULL;
+    HANDLE worker_handles[16];
 
     // create thread pool
     if (n_threads > 1) {
@@ -7125,7 +7178,12 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
                 },
                 .node   = NULL,
                 .shared = &state_shared,
+                .wait_handle = CreateEvent(NULL, FALSE, FALSE, NULL),
+                .fin_handle = CreateEvent(NULL, FALSE, FALSE, NULL),
             };
+
+            worker_handles[j] = workers[j].fin_handle;
+
             int rc = pthread_create(&workers[j].thrd, NULL, ggml_graph_compute_thread, &workers[j]);
             assert(rc == 0);
             UNUSED(rc);
@@ -7345,14 +7403,14 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
 
         // COMPUTE
         if (node->n_tasks > 1) {
-            if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
+            /*if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
                 atomic_store(&state_shared.has_work, false);
             }
 
             while (atomic_load(&state_shared.has_work)) {
                 ggml_lock_lock  (&state_shared.spin);
                 ggml_lock_unlock(&state_shared.spin);
-            }
+            }*/
 
             // launch thread pool
             for (int j = 0; j < n_threads - 1; j++) {
@@ -7364,16 +7422,17 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
                     .wdata = cgraph->work ? cgraph->work->data : NULL,
                 };
                 workers[j].node = node;
+                SetEvent(workers[j].wait_handle);
             }
 
-            atomic_fetch_sub(&state_shared.n_ready, 1);
+            /*atomic_fetch_sub(&state_shared.n_ready, 1);
 
             while (atomic_load(&state_shared.n_ready) > 0) {
                 ggml_lock_lock  (&state_shared.spin);
                 ggml_lock_unlock(&state_shared.spin);
             }
 
-            atomic_store(&state_shared.has_work, true);
+            atomic_store(&state_shared.has_work, true);*/
         }
 
         params.type = GGML_TASK_COMPUTE;
@@ -7381,7 +7440,8 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
 
         // wait for thread pool
         if (node->n_tasks > 1) {
-            if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
+            WaitForMultipleObjects(n_threads - 1, worker_handles, TRUE, INFINITE);
+            /*if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
                 atomic_store(&state_shared.has_work, false);
             }
 
@@ -7395,19 +7455,19 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
             while (atomic_load(&state_shared.n_ready) != 0) {
                 ggml_lock_lock  (&state_shared.spin);
                 ggml_lock_unlock(&state_shared.spin);
-            }
+            }*/
         }
 
         // FINALIZE
         if (node->n_tasks > 1) {
-            if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
+            /*if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
                 atomic_store(&state_shared.has_work, false);
             }
 
             while (atomic_load(&state_shared.has_work)) {
                 ggml_lock_lock  (&state_shared.spin);
                 ggml_lock_unlock(&state_shared.spin);
-            }
+            }*/
 
             // launch thread pool
             for (int j = 0; j < n_threads - 1; j++) {
@@ -7419,16 +7479,17 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
                     .wdata = cgraph->work ? cgraph->work->data : NULL,
                 };
                 workers[j].node = node;
+                SetEvent(workers[j].wait_handle);
             }
 
-            atomic_fetch_sub(&state_shared.n_ready, 1);
+            /*atomic_fetch_sub(&state_shared.n_ready, 1);
 
             while (atomic_load(&state_shared.n_ready) > 0) {
                 ggml_lock_lock  (&state_shared.spin);
                 ggml_lock_unlock(&state_shared.spin);
             }
 
-            atomic_store(&state_shared.has_work, true);
+            atomic_store(&state_shared.has_work, true);*/
         }
 
         params.type = GGML_TASK_FINALIZE;
@@ -7436,7 +7497,8 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
 
         // wait for thread pool
         if (node->n_tasks > 1) {
-            if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
+            WaitForMultipleObjects(n_threads - 1, worker_handles, TRUE, INFINITE);
+            /*if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
                 atomic_store(&state_shared.has_work, false);
             }
 
@@ -7450,7 +7512,7 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
             while (atomic_load(&state_shared.n_ready) != 0) {
                 ggml_lock_lock  (&state_shared.spin);
                 ggml_lock_unlock(&state_shared.spin);
-            }
+            }*/
         }
 
         // performance stats (node)
@@ -7470,6 +7532,7 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
         atomic_store(&state_shared.has_work, true);
 
         for (int j = 0; j < n_threads - 1; j++) {
+            SetEvent(workers[j].wait_handle);
             int rc = pthread_join(workers[j].thrd, NULL);
             assert(rc == 0);
             UNUSED(rc);

prsyahmi avatar Jan 07 '23 00:01 prsyahmi

@ggerganov Yes, 87dd4a30811ee07700ee6fee267508e8935b32fc is about half-a-second faster on the jfk example, I guess due to the FP16 lookup table.

fitzsim avatar Jan 08 '23 04:01 fitzsim

@fitzsim I won't be in any position to do anything any time soon, unfortunately. (link)

luke-jr avatar Jan 08 '23 04:01 luke-jr

Luke, I'm so sorry to hear the news. Good luck

On Sun, 8 Jan 2023, 2:35 pm Luke Dashjr, @.***> wrote:

@fitzsim https://github.com/fitzsim I won't be in any position to do anything any time soon, unfortunately. (link https://www.fxstreet.com/cryptocurrencies/news/bitcoin-core-developer-loses-nearly-35-million-in-btc-changpeng-zhao-of-binance-offers-help-202301020843 )

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/whisper.cpp/issues/300#issuecomment-1374706721, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALQR62KJKPXFD4JSHUO4WTWRI7ZNANCNFSM6AAAAAATFY3IEM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jaybinks avatar Jan 08 '23 05:01 jaybinks