multi-threaded parallel calls to `LinearAlgebra.mul!` are slower than serial calls
from: https://discourse.julialang.org/t/multi-threading-of-julia-1-8-5-does-not-improve-speed-when-combined-with-blas/97568/12
This is the minimal example I arrived at. It consists of calling mul! within a multi-threaded loop:
import Pkg
Pkg.activate(".")
using LinearAlgebra
using BenchmarkTools
using Random
BLAS.set_num_threads(1)
function test()
nMax = 1e2
mMax = 1e3
s = zeros(Threads.nthreads())
Threads.@threads for i in 1:nMax
tmp = 0.0
a = rand(10,10)
b = similar(a)
for m = 1:mMax
mul!(b, a, a)
tmp = tmp + sum(b)
end
s[Threads.threadid()] = tmp
end
return sum(s)
end
@btime test()
using MKL
@btime test()
When running with a single thread, one gets:
% julia -t 1 code.jl
Activating project at `~/Downloads`
29.816 ms (208 allocations: 175.73 KiB)
28.485 ms (208 allocations: 175.73 KiB)
meaning that LinearAlgebra.mul! and MKL.mul! are similar in performance.
With multi-threading, one gets:
% julia -t 4 code.jl
Activating project at `~/Downloads`
43.085 ms (226 allocations: 177.53 KiB)
10.046 ms (226 allocations: 177.53 KiB)
The MKL version is faster, as expected, but the LinearAlgebra.mul! version is slower than the serial one.
I've run this in 1.9.0+rc1 and 1.8.5 and the results are the same.
Try force BLAS to only use one thread. You might be over subscribing the CPU.
Isn't that what
BLAS.set_num_threads(1)
does?
By the way, the CPU use reported in top are the expected ones, that is, 100% for -t1 and 400% for -t4 (easy to track with a larger mMAx).
Ok, I read too fast.
Note that there's one issue in this code, that importing MKL will reset the number of BLAS threads, as it's a different environment variable that controls this. One needs to set the number of threads again after loading MKL FWIW I obtain a reasonable scaling, using only OpenBLAS:
(base) jishnu:temp/ $ julia -t 1 code.jl
Activating project at `~/temp`
BLAS.get_num_threads() = 1
17.956 ms (208 allocations: 175.73 KiB)
(base) jishnu:temp/ $ julia -t 4 code.jl
Activating project at `~/temp`
BLAS.get_num_threads() = 1
4.520 ms (225 allocations: 177.50 KiB)
Although, strangely, MKL seems slower for me. Using one Julia thread:
BLAS.get_num_threads() = 1
BLAS.get_config() = LBTConfig([ILP64] libopenblas64_.so)
17.950 ms (208 allocations: 175.73 KiB)
BLAS.get_num_threads() = 1
BLAS.get_config() = LBTConfig([ILP64] libmkl_rt.so, [LP64] libmkl_rt.so)
34.267 ms (208 allocations: 175.73 KiB)
and using 4 Julia threads:
BLAS.get_num_threads() = 1
BLAS.get_config() = LBTConfig([ILP64] libopenblas64_.so)
4.572 ms (226 allocations: 177.53 KiB)
BLAS.get_num_threads() = 1
BLAS.get_config() = LBTConfig([ILP64] libmkl_rt.so, [LP64] libmkl_rt.so)
8.315 ms (225 allocations: 177.50 KiB)
code
import Pkg
Pkg.activate(".")
using LinearAlgebra
using BenchmarkTools
using Random
BLAS.set_num_threads(1)
function test()
nMax = 1e2
mMax = 1e3
s = zeros(Threads.nthreads())
Threads.@threads for i in 1:nMax
tmp = 0.0
a = rand(10,10)
b = similar(a)
for m = 1:mMax
mul!(b, a, a)
tmp = tmp + sum(b)
end
s[Threads.threadid()] = tmp
end
return sum(s)
end
@show BLAS.get_num_threads()
@show BLAS.get_config()
@btime test()
using MKL
BLAS.set_num_threads(1)
@show BLAS.get_num_threads()
@show BLAS.get_config()
@btime test()
I'm benchmarking BLAS before loading MKL there.
Also I just added MKL to the test for comparison, the issue is independent of it.
The issue may be dependent on the platform. I'm using Linux x86_64 here.
I'm using the same platform
julia> versioninfo()
Julia Version 1.9.0-rc2
Commit 72aec423c2a (2023-04-01 10:41 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 8 × 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, tigerlake)
Threads: 1 on 8 virtual cores
Environment:
LD_LIBRARY_PATH = :/usr/lib/x86_64-linux-gnu/gtk-3.0/modules
JULIA_EDITOR = subl
Some more detailed results, now without the MKL stuff mixed up. The code I ran is this one:
import Pkg
Pkg.activate(temp=true)
Pkg.add(["BenchmarkTools"])
using Random
using BenchmarkTools
using LinearAlgebra
BLAS.set_num_threads(1)
function test(local_mul!::F) where F<:Function
nMax = 1e2
mMax = 1e3
s = zeros(Threads.nthreads())
Threads.@threads for i in 1:nMax
tmp = 0.0
a = rand(10,10)
b = similar(a)
for m = 1:mMax
local_mul!(b, a, a)
tmp = tmp + sum(b)
end
s[Threads.threadid()] = tmp
end
return sum(s)
end
@btime test($(LinearAlgebra.mul!))
It was run here with 1 or 4 threads, with the following command line:
julia --startup=no -t1 code.jl
# or
julia --startup=no -t4 code.jl
with:
julia> versioninfo()
Julia Version 1.9.0-rc1
Commit 3b2e0d8fbc1 (2023-03-07 07:51 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 8 × Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
Threads: 1 on 8 virtual cores
The problem is specifically in Base.LinearAlgebra:
LinearAlgebra:
30.146 ms (208 allocations: 175.73 KiB) # -t1 obs: top reports 100% CPU
52.320 ms (226 allocations: 177.53 KiB) # -t4 obs: top reports 400% CPU
So running multithreaded makes the code run more slowly.
OBS: commenting the BLAS.set_num_threads(1) line or setting (4) does not make any difference.
Just for comparison:
the same code, removing the BLAS.setnumthreads(1), and installing and loading either MKL or Octavian, with
the appropriate changes to run the correct mul! function:
MKL:
26.823 ms (208 allocations: 175.73 KiB) # -t1
9.372 ms (226 allocations: 177.53 KiB) # -t4
Octavian:
8.109 ms (208 allocations: 175.73 KiB)
2.807 ms (225 allocations: 177.50 KiB)
(the Octavian peformance is somewhat suprising here, and I'm posting it just because that may be related with how each of the backends interact with the threading, but it may have nothing to do as well).