Faisal Zaghloul

Results 13 comments of Faisal Zaghloul

Here are some perf figures: On W-2225 Xeon machine: CPU backend: | CPU | Model | Test | t/s master | t/s threadpool | Speedup | |:--------------------------------------|:--------------|:-------|-------------:|-----------------:|----------:| | Intel(R) Xeon(R)...

@slaren Threadpool is back! Updated it a bit to be aligned with the latest graph-compute design. The current performance is largely on par with OpenMP. Please lmk if you have...

> I tried to test this on macOS, but it seems to deadlock. Fixed!

On M2 Max: (GGML_NO_METAL=1 GGML_NO_ACCELERATE=1) | CPU | Model | Threads | Test | t/s master | t/s threadpool | Speedup | |:------|:--------------|----------:|:-------|-------------:|-----------------:|----------:| | | llama 7B Q4_0 | 4...

Same thing, but with llama-v3 8B Q4_0_4_4 (for some reason my compiler AppleClang15 doesn't support INT8 matmul?) | CPU | Model | Threads | Test | t/s master | t/s...

@slaren lmk if it works for you this time

> I tested this again on the M3 Max, but it still seems to deadlock. These are the call stacks of the threads: > > ``` > (lldb) bt all...

> > I tested this again on the M3 Max, but it still seems to deadlock. These are the call stacks of the threads: > > ``` > > (lldb)...

> Thanks, I was able to run it now. Unfortunately the results are still not very good on my system. Under WSL this threadpool is much slower than OpenMP. A...

Edit: I totally forgot that GGML_OPENMP is disabled only for cmake builds... So the numbers below are openmp only. (interesting that there is any change at all...) @slaren @max-krasnyansky latest...