gemm
gemm copied to clipboard
Slow parallelism on large number of threads
Hi,
While investigating the crate performance I found out that running parallelism could be highly detrimental to performance. This only occurs with machines with a lot of cores (and therefore threads)
Here is the bench I added https://github.com/Narsil/gemm/tree/bench_rayon
On a regular desktop (8 cores) I see:
parallelism-8-f32-nnn-gemm-6×2304×768
time: [176.79 µs 182.91 µs 186.61 µs]
change: [-9.5273% -4.5251% -0.1034%] (p = 0.10 > 0.05)
No change in performance detected.
parallelism-none-f32-nnn-gemm-6×2304×768
time: [685.08 µs 686.80 µs 687.87 µs]
change: [-1.0090% -0.5130% -0.1028%] (p = 0.04 < 0.05)
Change within noise threshold.
parallelism-8-f32-nnt-gemm-6×2304×768
time: [433.25 µs 444.28 µs 459.26 µs]
change: [+9.9070% +13.388% +16.764%] (p = 0.00 < 0.05)
Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
parallelism-none-f32-nnt-gemm-6×2304×768
time: [1.3960 ms 1.4004 ms 1.4051 ms]
change: [+14.439% +15.374% +16.258%] (p = 0.00 < 0.05)
Performance has regressed.
Which is sort of OK, 8 parallelism is indeed ~3.5x faster so some speedups
However on 48 cores:
parallelism-48-f32-nnn-gemm-6×2304×768
time: [2.2364 ms 2.2723 ms 2.3164 ms]
Found 2 outliers among 10 measurements (20.00%)
1 (10.00%) low mild
1 (10.00%) high severe
parallelism-none-f32-nnn-gemm-6×2304×768
time: [752.12 µs 752.97 µs 753.81 µs]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) low mild
parallelism-48-f32-nnt-gemm-6×2304×768
time: [2.3022 ms 2.3255 ms 2.3660 ms]
parallelism-none-f32-nnt-gemm-6×2304×768
time: [789.54 µs 789.93 µs 790.39 µs]
There is a big slowdown from over parallelism.
The flamegraph actually shows this pretty well
Is there anything we can do to help here ?
I'm under the impression that using a simple par_chunks
instead of par_iter
with maybe some length heuristics could help spawn little amount of threads when the matmul is small enough.