LinearAlgebra.jl multi-threaded parallel calls to `LinearAlgebra.mul!` are slower than serial calls

from: https://discourse.julialang.org/t/multi-threading-of-julia-1-8-5-does-not-improve-speed-when-combined-with-blas/97568/12

This is the minimal example I arrived at. It consists of calling mul! within a multi-threaded loop:

import Pkg
Pkg.activate(".")

using LinearAlgebra
using BenchmarkTools
using Random

BLAS.set_num_threads(1)

function test()
    nMax = 1e2
    mMax = 1e3
    s = zeros(Threads.nthreads())
    Threads.@threads for i in 1:nMax
        tmp = 0.0
        a = rand(10,10)
        b = similar(a)
        for m = 1:mMax
            mul!(b, a, a)
            tmp = tmp + sum(b)
        end
        s[Threads.threadid()] = tmp
    end
    return sum(s)
end

@btime test()

using MKL
@btime test()

When running with a single thread, one gets:

% julia -t 1 code.jl
   Activating project at `~/Downloads`
  29.816 ms (208 allocations: 175.73 KiB)
  28.485 ms (208 allocations: 175.73 KiB)

meaning that LinearAlgebra.mul! and MKL.mul! are similar in performance.

With multi-threading, one gets:

% julia -t 4 code.jl
  Activating project at `~/Downloads`
  43.085 ms (226 allocations: 177.53 KiB)
  10.046 ms (226 allocations: 177.53 KiB)

The MKL version is faster, as expected, but the LinearAlgebra.mul! version is slower than the serial one.

I've run this in 1.9.0+rc1 and 1.8.5 and the results are the same.

Apr 21 '23 22:04 lmiq

Try force BLAS to only use one thread. You might be over subscribing the CPU.

Apr 22 '23 04:04 KristofferC

Isn't that what

BLAS.set_num_threads(1)

does?

By the way, the CPU use reported in top are the expected ones, that is, 100% for -t1 and 400% for -t4 (easy to track with a larger mMAx).

Apr 22 '23 09:04 lmiq

Ok, I read too fast.

Apr 22 '23 11:04 KristofferC

Note that there's one issue in this code, that importing MKL will reset the number of BLAS threads, as it's a different environment variable that controls this. One needs to set the number of threads again after loading MKL FWIW I obtain a reasonable scaling, using only OpenBLAS:

(base) jishnu:temp/ $ julia -t 1 code.jl
  Activating project at `~/temp`
BLAS.get_num_threads() = 1
  17.956 ms (208 allocations: 175.73 KiB)

(base) jishnu:temp/ $ julia -t 4 code.jl
  Activating project at `~/temp`
BLAS.get_num_threads() = 1
  4.520 ms (225 allocations: 177.50 KiB)

Although, strangely, MKL seems slower for me. Using one Julia thread:

BLAS.get_num_threads() = 1
BLAS.get_config() = LBTConfig([ILP64] libopenblas64_.so)
  17.950 ms (208 allocations: 175.73 KiB)
BLAS.get_num_threads() = 1
BLAS.get_config() = LBTConfig([ILP64] libmkl_rt.so, [LP64] libmkl_rt.so)
  34.267 ms (208 allocations: 175.73 KiB)

and using 4 Julia threads:

BLAS.get_num_threads() = 1
BLAS.get_config() = LBTConfig([ILP64] libopenblas64_.so)
  4.572 ms (226 allocations: 177.53 KiB)
BLAS.get_num_threads() = 1
BLAS.get_config() = LBTConfig([ILP64] libmkl_rt.so, [LP64] libmkl_rt.so)
  8.315 ms (225 allocations: 177.50 KiB)

code

import Pkg
Pkg.activate(".")

using LinearAlgebra
using BenchmarkTools
using Random

BLAS.set_num_threads(1)

function test()
    nMax = 1e2
    mMax = 1e3
    s = zeros(Threads.nthreads())
    Threads.@threads for i in 1:nMax
        tmp = 0.0
        a = rand(10,10)
        b = similar(a)
        for m = 1:mMax
            mul!(b, a, a)
            tmp = tmp + sum(b)
        end
        s[Threads.threadid()] = tmp
    end
    return sum(s)
end

@show BLAS.get_num_threads()
@show BLAS.get_config()
@btime test()

using MKL
BLAS.set_num_threads(1)
@show BLAS.get_num_threads()
@show BLAS.get_config()
@btime test()

Apr 22 '23 14:04 jishnub

I'm benchmarking BLAS before loading MKL there.

Also I just added MKL to the test for comparison, the issue is independent of it.

The issue may be dependent on the platform. I'm using Linux x86_64 here.

Apr 22 '23 14:04 lmiq

I'm using the same platform

julia> versioninfo()
Julia Version 1.9.0-rc2
Commit 72aec423c2a (2023-04-01 10:41 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, tigerlake)
  Threads: 1 on 8 virtual cores
Environment:
  LD_LIBRARY_PATH = :/usr/lib/x86_64-linux-gnu/gtk-3.0/modules
  JULIA_EDITOR = subl

Apr 22 '23 14:04 jishnub

Some more detailed results, now without the MKL stuff mixed up. The code I ran is this one:

import Pkg
Pkg.activate(temp=true)
Pkg.add(["BenchmarkTools"])

using Random
using BenchmarkTools
using LinearAlgebra

BLAS.set_num_threads(1)

function test(local_mul!::F) where F<:Function
    nMax = 1e2
    mMax = 1e3
    s = zeros(Threads.nthreads())
    Threads.@threads for i in 1:nMax
        tmp = 0.0
        a = rand(10,10)
        b = similar(a)
        for m = 1:mMax
            local_mul!(b, a, a)
            tmp = tmp + sum(b)
        end
        s[Threads.threadid()] = tmp
    end
    return sum(s)
end

@btime test($(LinearAlgebra.mul!))

It was run here with 1 or 4 threads, with the following command line:

julia --startup=no -t1 code.jl
# or
julia --startup=no -t4 code.jl

with:

julia> versioninfo()
Julia Version 1.9.0-rc1
Commit 3b2e0d8fbc1 (2023-03-07 07:51 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores

The problem is specifically in Base.LinearAlgebra:

LinearAlgebra:

  30.146 ms (208 allocations: 175.73 KiB)   # -t1  obs: top reports 100% CPU
  52.320 ms (226 allocations: 177.53 KiB)   # -t4  obs: top reports 400% CPU

So running multithreaded makes the code run more slowly.

OBS: commenting the `BLAS.set_num_threads(1)` line or setting `(4)` does not make any difference.

Just for comparison:

the same code, removing the BLAS.setnumthreads(1), and installing and loading either MKL or Octavian, with the appropriate changes to run the correct mul! function:

MKL:

  26.823 ms (208 allocations: 175.73 KiB) # -t1
  9.372 ms (226 allocations: 177.53 KiB)  # -t4

Octavian:

  8.109 ms (208 allocations: 175.73 KiB)
  2.807 ms (225 allocations: 177.50 KiB)

(the Octavian peformance is somewhat suprising here, and I'm posting it just because that may be related with how each of the backends interact with the threading, but it may have nothing to do as well).

Apr 22 '23 22:04 lmiq

multi-threaded parallel calls to `LinearAlgebra.mul!` are slower than serial calls

The problem is specifically in Base.LinearAlgebra:

OBS: commenting the BLAS.set_num_threads(1) line or setting (4) does not make any difference.

Just for comparison:

OBS: commenting the `BLAS.set_num_threads(1)` line or setting `(4)` does not make any difference.