gemm icon indicating copy to clipboard operation
gemm copied to clipboard

Slow parallelism on large number of threads

Open Narsil opened this issue 1 year ago • 0 comments

Hi,

While investigating the crate performance I found out that running parallelism could be highly detrimental to performance. This only occurs with machines with a lot of cores (and therefore threads)

Here is the bench I added https://github.com/Narsil/gemm/tree/bench_rayon

On a regular desktop (8 cores) I see:

parallelism-8-f32-nnn-gemm-6×2304×768
                        time:   [176.79 µs 182.91 µs 186.61 µs]
                        change: [-9.5273% -4.5251% -0.1034%] (p = 0.10 > 0.05)
                        No change in performance detected.

parallelism-none-f32-nnn-gemm-6×2304×768
                        time:   [685.08 µs 686.80 µs 687.87 µs]
                        change: [-1.0090% -0.5130% -0.1028%] (p = 0.04 < 0.05)
                        Change within noise threshold.

parallelism-8-f32-nnt-gemm-6×2304×768
                        time:   [433.25 µs 444.28 µs 459.26 µs]
                        change: [+9.9070% +13.388% +16.764%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

parallelism-none-f32-nnt-gemm-6×2304×768
                        time:   [1.3960 ms 1.4004 ms 1.4051 ms]
                        change: [+14.439% +15.374% +16.258%] (p = 0.00 < 0.05)
                        Performance has regressed.

Which is sort of OK, 8 parallelism is indeed ~3.5x faster so some speedups

However on 48 cores:

parallelism-48-f32-nnn-gemm-6×2304×768
                        time:   [2.2364 ms 2.2723 ms 2.3164 ms]
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low mild
  1 (10.00%) high severe

parallelism-none-f32-nnn-gemm-6×2304×768
                        time:   [752.12 µs 752.97 µs 753.81 µs]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild

parallelism-48-f32-nnt-gemm-6×2304×768
                        time:   [2.3022 ms 2.3255 ms 2.3660 ms]

parallelism-none-f32-nnt-gemm-6×2304×768
                        time:   [789.54 µs 789.93 µs 790.39 µs]

There is a big slowdown from over parallelism.

The flamegraph actually shows this pretty well flamegraph

Is there anything we can do to help here ? I'm under the impression that using a simple par_chunks instead of par_iter with maybe some length heuristics could help spawn little amount of threads when the matmul is small enough.

Narsil avatar Jul 16 '23 14:07 Narsil