Faisal Zaghloul
Faisal Zaghloul
Here are some perf figures: On W-2225 Xeon machine: CPU backend: | CPU | Model | Test | t/s master | t/s threadpool | Speedup | |:--------------------------------------|:--------------|:-------|-------------:|-----------------:|----------:| | Intel(R) Xeon(R)...
@slaren Threadpool is back! Updated it a bit to be aligned with the latest graph-compute design. The current performance is largely on par with OpenMP. Please lmk if you have...
> I tried to test this on macOS, but it seems to deadlock. Fixed!
On M2 Max: (GGML_NO_METAL=1 GGML_NO_ACCELERATE=1) | CPU | Model | Threads | Test | t/s master | t/s threadpool | Speedup | |:------|:--------------|----------:|:-------|-------------:|-----------------:|----------:| | | llama 7B Q4_0 | 4...
Same thing, but with llama-v3 8B Q4_0_4_4 (for some reason my compiler AppleClang15 doesn't support INT8 matmul?) | CPU | Model | Threads | Test | t/s master | t/s...
@slaren lmk if it works for you this time
> I tested this again on the M3 Max, but it still seems to deadlock. These are the call stacks of the threads: > > ``` > (lldb) bt all...
> > I tested this again on the M3 Max, but it still seems to deadlock. These are the call stacks of the threads: > > ``` > > (lldb)...
> Thanks, I was able to run it now. Unfortunately the results are still not very good on my system. Under WSL this threadpool is much slower than OpenMP. A...
Edit: I totally forgot that GGML_OPENMP is disabled only for cmake builds... So the numbers below are openmp only. (interesting that there is any change at all...) @slaren @max-krasnyansky latest...