Very slow - any way to speed up?
Per #10,
Noting that the processing time is considerably shorter than the length of speech,
Yet even using 64 threads, it's taking days to process minutes of audio on my POWER9.
Has something changed since #10, or is there something I am doing wrong?
What is your platform?
@luke-jr I'm not familiar with POWER9, but from a quick ChatGPT search, it seems this CPU has a RISC architecture:

Currently, whisper.cpp supports only x86 and ARM architectures. By support, it means that it uses the available SIMD instruction set to make the computation efficient. On other architectures, it will fallback to non-SIMD computation which is multiple times slower.
Adding support for Power ISA (or whatever the instruction set is called) should not be very difficult. The matrix multiplication routines in ggml.c need to be extended to support the respective instruction set and the corresponding compile flags added to the Makefile.
I don't have experience with this architecture, so hopefully someone contributes. It will be very interesting to see what is the performance on these CPUs.
Yeah, I'm not surprised it isn't optimised for PPC64, but I wouldn't expect it to be magnitudes slower either. Real-time to days is a huge difference. :/
For example on my Ryzen 9 5950X if I remove the -mavx -mavx2 -mfma -mf16c flags I observed about x50 slower computation of the bench tool. Removing those flags is similar to what you have on the PPC64 - no SIMD, no F16C support.
SIMD can make a huge difference
ChatGPT is out-of-date regarding the Power ISA being proprietary. It is open source now, just like RISC-V. See https://openpowerfoundation.org/.
After #320, ./main -m models/ggml-base.en.bin -f samples/jfk.wav takes 15.7 seconds.
| Additional options | Time |
|---|---|
-p 64 |
77s |
-t 64 |
28s |
-t 1 |
59.6s |
-t 16 |
5.8s |
-t 32 |
5.1s |
-m models/ggml-large.bin -t 32 |
111.1s |
ChatGPT is out-of-date regarding the Power ISA being proprietary.
In my experience, ChatGPT tends to be wrong quite often.
@fitzsim @luke-jr
I am planning to merge a refactored version of the SIMD routines in ggml which I think will make things easier to maintain in the future. The PR is pretty much ready in #324
All instruction sets fit quite nicely in the proposed pattern, but I'm having a little trouble with the ppc64le stuff since I don't have a way to test it. So for the moment, I've special-cased it, which is not great.
If you are interested and have some free time, you can take a look at the implementation and see if you can fit it in the new pattern. Or at the very least - run a test and see that it still works after the changes.
Regarding the new performance: 5s for jfk.wav using base still seems quite a lot. Not sure why the performance is so bad
@ggerganov, sure, I'll try to fit the POWER9 optimizations into the main SIMD structure, some time after #324 lands in the master branch.
Agreed regarding 5s likely not being optimal. @luke-jr, can you add the whisper_print_timings lines to your table? They may contain hints about further optimization efforts.
$ time ./main -m models/ggml-base.en.bin -f samples/jfk.wav -t 32
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 2
whisper_model_load: adding 1607 extra tokens
whisper_model_load: mem_required = 506.00 MB
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size = 22.83 MB
whisper_model_load: model size = 140.54 MB
system_info: n_threads = 32 / 64 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 32 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
whisper_print_timings: load time = 110.77 ms
whisper_print_timings: mel time = 49.83 ms
whisper_print_timings: sample time = 8.41 ms
whisper_print_timings: encode time = 3631.16 ms / 605.19 ms per layer
whisper_print_timings: decode time = 1374.97 ms / 229.16 ms per layer
whisper_print_timings: total time = 5175.76 ms
real 0m5.187s
user 2m31.675s
sys 0m1.078s
The remaining slowness seems to be in the short-to-fp32 conversion. Would it make sense to try a GGML_TYPE_F32 version of ggml-base.en.bin, to eliminate the conversion steps? Can someone outline steps for trying that?
The steps are like this:
# we need this for the f32 conversion
git clone https://github.com/openai/whisper
# create f32 ggml model (assumes you have ~/.cache/whisper/base.en.pt downloaded from original repo)
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
python3 models/convert-pt-to-ggml.py ~/.cache/whisper/base.en.pt ../whisper . 1
# use the new f32 model
make -j
./main -m ./ggml-model-f32.bin samples/jfk.wav
You need the following patch/hack in whisper.cpp to increase the memory buffers:
diff --git a/whisper.cpp b/whisper.cpp
index 84c2490..8709723 100644
--- a/whisper.cpp
+++ b/whisper.cpp
@@ -131,7 +131,7 @@ static const std::map<std::string, std::pair<int, std::string>> g_lang = {
{ "su", { 98, "sundanese", } },
};
-static const size_t MB = 1024*1024;
+static const size_t MB = 3*1024*1024;
static const std::map<e_model, size_t> MEM_REQ_MODEL = {
{ MODEL_TINY, 74ull*MB },
I used the Visual Studio performance profiler to see where all the CPU time is spent. Half the time is in the SIMD code here:

I reviewed the code for any obvious opportunities for speed up. Nothing major except I believe ax[] and ay[] are not neccessary. You can write the summation like so:
sum[j] = GGML_F16_VEC_FMA(sum[j], GGML_F16_VEC_LOAD(x + i + jGGML_F16_EPR), GGML_F16_VEC_LOAD(y + i + jGGML_F16_EPR));
This didn't help the times though; I think the optimizing compiler figures this out on it's own.
The other thing that stands out is this:
Not sure if this is an opportunity for improvement. I was thinking instead of looping with while, might want to use an event???
Thanks for the model instructions @ggerganov.
With the FP32 model and #366 I get:
$ time ./main -t 32 -m ../fp32-model/ggml-model-f32.bin samples/jfk.wav
whisper_model_load: loading model from '../fp32-model/ggml-model-f32.bin'
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 0
whisper_model_load: type = 2
whisper_model_load: adding 1607 extra tokens
whisper_model_load: mem_required = 1518.00 MB
whisper_model_load: ggml ctx size = 276.98 MB
whisper_model_load: memory size = 22.83 MB
whisper_model_load: model size = 276.92 MB
system_info: n_threads = 32 / 64 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 32 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
whisper_print_timings: load time = 236.47 ms
whisper_print_timings: mel time = 42.08 ms
whisper_print_timings: sample time = 4.75 ms
whisper_print_timings: encode time = 1945.92 ms / 324.32 ms per layer
whisper_print_timings: decode time = 933.50 ms / 155.58 ms per layer
whisper_print_timings: total time = 3163.23 ms
real 0m3.182s
user 1m17.748s
sys 0m0.607s
@fitzsim Great work! Will take a look at the PRs in the following days and merge after I make sure the other platforms work correctly.
Hi, I'm kind of agreeing with @RndyP
I've profiled it few weeks ago and found out that you are using spin locks. I changed it to event and using WaitForMultipleObjects (I'm on windows). CPU usage did tamed down but I didn't bother to bench it at that time.
This is the bench results for commit afe2db0fe2950049a460aa02173aaeb8b4a78e02 The one with Event seems to perform better on my PC.
CPU: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz 2.21 GHz
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 2
whisper_model_load: adding 1607 extra tokens
whisper_model_load: mem_required = 506.00 MB
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size = 22.83 MB
whisper_model_load: model size = 140.54 MB
system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |
Spinlock: All cores 100% CPU usage
whisper_print_timings: load time = 312.28 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 2975.37 ms / 495.89 ms per layer
whisper_print_timings: decode time = 0.00 ms / 0.00 ms per layer
whisper_print_timings: total time = 3288.11 ms
whisper_print_timings: load time = 285.14 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 2932.89 ms / 488.81 ms per layer
whisper_print_timings: decode time = 0.00 ms / 0.00 ms per layer
whisper_print_timings: total time = 3218.43 ms
whisper_print_timings: load time = 267.65 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 2930.10 ms / 488.35 ms per layer
whisper_print_timings: decode time = 0.00 ms / 0.00 ms per layer
whisper_print_timings: total time = 3198.02 ms
whisper_print_timings: load time = 270.98 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 2821.18 ms / 470.20 ms per layer
whisper_print_timings: decode time = 0.00 ms / 0.00 ms per layer
whisper_print_timings: total time = 3092.38 ms
Event: CPU usage tamed
whisper_print_timings: load time = 308.21 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 2791.27 ms / 465.21 ms per layer
whisper_print_timings: decode time = 0.00 ms / 0.00 ms per layer
whisper_print_timings: total time = 3099.88 ms
whisper_print_timings: load time = 268.62 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 2687.68 ms / 447.95 ms per layer
whisper_print_timings: decode time = 0.00 ms / 0.00 ms per layer
whisper_print_timings: total time = 2956.58 ms
whisper_print_timings: load time = 267.01 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 2727.19 ms / 454.53 ms per layer
whisper_print_timings: decode time = 0.00 ms / 0.00 ms per layer
whisper_print_timings: total time = 2994.49 ms
whisper_print_timings: load time = 294.01 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 2803.29 ms / 467.22 ms per layer
whisper_print_timings: decode time = 0.00 ms / 0.00 ms per layer
whisper_print_timings: total time = 3097.75 ms
whisper_print_timings: load time = 293.43 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 2876.54 ms / 479.42 ms per layer
whisper_print_timings: decode time = 0.00 ms / 0.00 ms per layer
whisper_print_timings: total time = 3170.70 ms
The more recent one seems slower on my PC, without any change to the code: f00509d57cc8e208ad2153aff3fe0af924289abc Spinlock:
whisper_print_timings: load time = 268.64 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 3209.34 ms / 534.89 ms per layer
whisper_print_timings: decode time = 0.00 ms / 0.00 ms per layer
whisper_print_timings: total time = 3478.29 ms
whisper_print_timings: load time = 270.09 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 3391.52 ms / 565.25 ms per layer
whisper_print_timings: decode time = 0.00 ms / 0.00 ms per layer
whisper_print_timings: total time = 3661.90 ms
whisper_print_timings: load time = 310.33 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 3375.87 ms / 562.64 ms per layer
whisper_print_timings: decode time = 0.00 ms / 0.00 ms per layer
whisper_print_timings: total time = 3686.47 ms
Can you demonstrate the Event-based Windows implementation?
I tried waiting on condition_variable instead of spin locks, but it wasn't more efficient. Maybe I missed something.
@luke-jr
Now that #369 is merged can you try bench with various arguments, and post an updated table of results to #89? Then #300 can probably be closed.
@fitzsim We just merged a FP16 lookup-table (#368) that is used when F16C intrinsics are not available. I believe that this will lead to significant improvement on POWER9 platforms using the F16 models.
@ggerganov I'm using WinAPI directly. My intention was to reduce CPU usage, maybe I'll try again with condition_var and see if it makes any diffrence
index c5780ed..7ad5be6 100644
--- "a/ggml.c"
+++ "b/ggml.c"
@@ -1118,7 +1118,44 @@ inline static void ggml_vec_mad_f16(const int n, ggml_fp16_t * restrict y, ggml_
#endif
}
-inline static void ggml_vec_scale_f32(const int n, float * y, const float v) { for (int i = 0; i < n; ++i) y[i] *= v; }
+//inline static void ggml_vec_scale_f32(const int n, float * y, const float v) { for (int i = 0; i < n; ++i) y[i] *= v; }
+inline static void ggml_vec_scale_f32(const int n, float * y, const float v) {
+#if defined(__AVX__) || defined(__AVX2__)
+ // AVX 256-bit
+ const int n32 = (n & ~31);
+
+ const __m256 v4 = _mm256_set1_ps(v);
+
+ __m256 y0, y1, y2, y3;
+
+ for (int i = 0; i < n32; i += 32) {
+ y0 = _mm256_loadu_ps(y + i + 0);
+ y1 = _mm256_loadu_ps(y + i + 8);
+ y2 = _mm256_loadu_ps(y + i + 16);
+ y3 = _mm256_loadu_ps(y + i + 24);
+
+ y0 = _mm256_mul_ps(y0, v4);
+ y1 = _mm256_mul_ps(y1, v4);
+ y2 = _mm256_mul_ps(y2, v4);
+ y3 = _mm256_mul_ps(y3, v4);
+
+ _mm256_storeu_ps(y + i + 0, y0);
+ _mm256_storeu_ps(y + i + 8, y1);
+ _mm256_storeu_ps(y + i + 16, y2);
+ _mm256_storeu_ps(y + i + 24, y3);
+ }
+
+ // leftovers
+ for (int i = n32; i < n; ++i) {
+ y[i] *= v;
+ }
+#else
+ // scalar
+ for (int i = 0; i < n; ++i) {
+ y[i] *= v;
+ }
+#endif
+}
inline static void ggml_vec_norm_f32 (const int n, float * s, const float * x) { ggml_vec_dot_f32(n, s, x, x); *s = sqrt(*s); }
inline static void ggml_vec_sqr_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = x[i]*x[i]; }
inline static void ggml_vec_sqrt_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = sqrt(x[i]); }
@@ -1621,7 +1658,7 @@ struct ggml_tensor * ggml_new_tensor_impl(
size_needed += sizeof(struct ggml_tensor);
if (cur_end + size_needed + GGML_OBJECT_SIZE > ctx->mem_size) {
- GGML_PRINT("%s: not enough space in the context's memory pool\n", __func__);
+ GGML_PRINT("%s: not enough space in the context's memory pool (%zu/%zu needed)\n", __func__, cur_end + size_needed + GGML_OBJECT_SIZE, ctx->mem_size);
assert(false);
return NULL;
}
@@ -7010,7 +7047,7 @@ typedef int ggml_lock_t;
#define ggml_lock_init(x) UNUSED(x)
#define ggml_lock_destroy(x) UNUSED(x)
-#define ggml_lock_lock(x) UNUSED(x)
+#define ggml_lock_lock(x) Sleep(1)
#define ggml_lock_unlock(x) UNUSED(x)
#define GGML_LOCK_INITIALIZER 0
@@ -7035,6 +7072,9 @@ struct ggml_compute_state {
struct ggml_tensor * node;
struct ggml_compute_state_shared * shared;
+
+ HANDLE wait_handle;
+ HANDLE fin_handle;
};
// function used by each compute thread
@@ -7052,6 +7092,17 @@ thread_ret_t ggml_graph_compute_thread(void * data) {
const int n_threads = state->shared->n_threads;
while (true) {
+ WaitForSingleObject(state->wait_handle, INFINITE);
+ if (state->node) {
+ ggml_compute_forward(&state->params, state->node);
+ state->node = NULL;
+ SetEvent(state->fin_handle);
+ } else {
+ SetEvent(state->fin_handle);
+ break;
+ }
+
+ /*
if (atomic_fetch_add(&state->shared->n_ready, 1) == n_threads - 1) {
atomic_store(&state->shared->has_work, false);
} else {
@@ -7086,6 +7137,7 @@ thread_ret_t ggml_graph_compute_thread(void * data) {
} else {
break;
}
+ */
}
return 0;
@@ -7106,6 +7158,7 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
/*.stop =*/ false,
};
struct ggml_compute_state * workers = n_threads > 1 ? alloca(sizeof(struct ggml_compute_state)*(n_threads - 1)) : NULL;
+ HANDLE worker_handles[16];
// create thread pool
if (n_threads > 1) {
@@ -7125,7 +7178,12 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
},
.node = NULL,
.shared = &state_shared,
+ .wait_handle = CreateEvent(NULL, FALSE, FALSE, NULL),
+ .fin_handle = CreateEvent(NULL, FALSE, FALSE, NULL),
};
+
+ worker_handles[j] = workers[j].fin_handle;
+
int rc = pthread_create(&workers[j].thrd, NULL, ggml_graph_compute_thread, &workers[j]);
assert(rc == 0);
UNUSED(rc);
@@ -7345,14 +7403,14 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
// COMPUTE
if (node->n_tasks > 1) {
- if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
+ /*if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
atomic_store(&state_shared.has_work, false);
}
while (atomic_load(&state_shared.has_work)) {
ggml_lock_lock (&state_shared.spin);
ggml_lock_unlock(&state_shared.spin);
- }
+ }*/
// launch thread pool
for (int j = 0; j < n_threads - 1; j++) {
@@ -7364,16 +7422,17 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
.wdata = cgraph->work ? cgraph->work->data : NULL,
};
workers[j].node = node;
+ SetEvent(workers[j].wait_handle);
}
- atomic_fetch_sub(&state_shared.n_ready, 1);
+ /*atomic_fetch_sub(&state_shared.n_ready, 1);
while (atomic_load(&state_shared.n_ready) > 0) {
ggml_lock_lock (&state_shared.spin);
ggml_lock_unlock(&state_shared.spin);
}
- atomic_store(&state_shared.has_work, true);
+ atomic_store(&state_shared.has_work, true);*/
}
params.type = GGML_TASK_COMPUTE;
@@ -7381,7 +7440,8 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
// wait for thread pool
if (node->n_tasks > 1) {
- if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
+ WaitForMultipleObjects(n_threads - 1, worker_handles, TRUE, INFINITE);
+ /*if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
atomic_store(&state_shared.has_work, false);
}
@@ -7395,19 +7455,19 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
while (atomic_load(&state_shared.n_ready) != 0) {
ggml_lock_lock (&state_shared.spin);
ggml_lock_unlock(&state_shared.spin);
- }
+ }*/
}
// FINALIZE
if (node->n_tasks > 1) {
- if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
+ /*if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
atomic_store(&state_shared.has_work, false);
}
while (atomic_load(&state_shared.has_work)) {
ggml_lock_lock (&state_shared.spin);
ggml_lock_unlock(&state_shared.spin);
- }
+ }*/
// launch thread pool
for (int j = 0; j < n_threads - 1; j++) {
@@ -7419,16 +7479,17 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
.wdata = cgraph->work ? cgraph->work->data : NULL,
};
workers[j].node = node;
+ SetEvent(workers[j].wait_handle);
}
- atomic_fetch_sub(&state_shared.n_ready, 1);
+ /*atomic_fetch_sub(&state_shared.n_ready, 1);
while (atomic_load(&state_shared.n_ready) > 0) {
ggml_lock_lock (&state_shared.spin);
ggml_lock_unlock(&state_shared.spin);
}
- atomic_store(&state_shared.has_work, true);
+ atomic_store(&state_shared.has_work, true);*/
}
params.type = GGML_TASK_FINALIZE;
@@ -7436,7 +7497,8 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
// wait for thread pool
if (node->n_tasks > 1) {
- if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
+ WaitForMultipleObjects(n_threads - 1, worker_handles, TRUE, INFINITE);
+ /*if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
atomic_store(&state_shared.has_work, false);
}
@@ -7450,7 +7512,7 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
while (atomic_load(&state_shared.n_ready) != 0) {
ggml_lock_lock (&state_shared.spin);
ggml_lock_unlock(&state_shared.spin);
- }
+ }*/
}
// performance stats (node)
@@ -7470,6 +7532,7 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
atomic_store(&state_shared.has_work, true);
for (int j = 0; j < n_threads - 1; j++) {
+ SetEvent(workers[j].wait_handle);
int rc = pthread_join(workers[j].thrd, NULL);
assert(rc == 0);
UNUSED(rc);
@ggerganov Yes, 87dd4a30811ee07700ee6fee267508e8935b32fc is about half-a-second faster on the jfk example, I guess due to the FP16 lookup table.
@fitzsim I won't be in any position to do anything any time soon, unfortunately. (link)
Luke, I'm so sorry to hear the news. Good luck
On Sun, 8 Jan 2023, 2:35 pm Luke Dashjr, @.***> wrote:
@fitzsim https://github.com/fitzsim I won't be in any position to do anything any time soon, unfortunately. (link https://www.fxstreet.com/cryptocurrencies/news/bitcoin-core-developer-loses-nearly-35-million-in-btc-changpeng-zhao-of-binance-offers-help-202301020843 )
— Reply to this email directly, view it on GitHub https://github.com/ggerganov/whisper.cpp/issues/300#issuecomment-1374706721, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALQR62KJKPXFD4JSHUO4WTWRI7ZNANCNFSM6AAAAAATFY3IEM . You are receiving this because you are subscribed to this thread.Message ID: @.***>