LinearAlgebra.jl icon indicating copy to clipboard operation
LinearAlgebra.jl copied to clipboard

Suboptimal `dgemm` OpenBLAS performance and dual-socket scaling

Open carstenbauer opened this issue 3 years ago • 5 comments
trafficstars

From https://github.com/carstenbauer/julia-dgemm-noctua/issues/2. (cc @ViralBShah)

I benchmarked a simple dgemm call (i.e. mul!(C,A,B)) on Noctua 1 (single and dual-socket Intel Xeon Gold "Skylake" 6148 20-Core CPUs) for multiple BLAS libraries (called from Julia using LBT)

  • 20 cores -> single-socket
  • 40 cores -> dual-socket (full node)

Here are the benchmark results:

BLAS # cores size GFLOPS
Intel MKL v2022.0.0 (JLL) 40 cores 10240 2081
Intel MKL v2022.0.0 (JLL) 20 cores 10240 1054
BLIS 0.9.0 (JLL) 40 cores 10240 1890
BLIS 0.9.0 (JLL) 20 cores 10240 990
Octavian 0.3.15 40 cores 10240 1053
Octavian 0.3.15 20 cores 10240 1016
OpenBLAS (shipped with Julia 1.8) 40 cores 10240 1092
OpenBLAS (shipped with Julia 1.8) 20 cores 10240 1063
--------------------- ----------- ------- ------
OpenBLAS 0.3.17 (custom) 40 cores 10240 1908
OpenBLAS 0.3.17 (custom) 20 cores 10240 1439
OpenBLAS 0.3.20 (custom) 40 cores 10240 1897
OpenBLAS 0.3.20 (custom) 20 cores 10240 1444
--------------------- ----------- ------- ------
OpenBLAS 0.3.17 (JLL) 40 cores 10240 1437
OpenBLAS 0.3.17 (JLL) 20 cores 10240 1124
OpenBLAS 0.3.20 (JLL) 40 cores 10240 1535
OpenBLAS 0.3.20 (JLL) 20 cores 10240 1185

The custom OpenBLAS has been compiled with

make INTERFACE64=1 USE_THREAD=1 NO_AFFINITY=1 GEMM_MULTITHREADING_THRESHOLD=50 NO_STATIC=1 BINARY=64

Primary observations/conclusions:

  • MKL and BLIS (through MKL.jl and BLISBLAS.jl) scale reasonably well from single to dual-socket but OpenBLAS shipped with Julia 1.8 doesn't scale at all. (Octavian also doesn't scale, see https://github.com/JuliaLinearAlgebra/Octavian.jl/issues/151)
  • A custom build of OpenBLAS shows overall best single-socket performance and scales reasonably well. Therefore it is not just that OpenBLAS is inferior to MKL/BLIS. Perhaps we use suboptimal build options?
  • What is particularly curious is the using OpenBLAS_jll (0.3.17 and 0.3.20) manually leads to strictly better performance (both in terms of numbers and scaling) than the default/shipped OpenBLAS. How is the default integration of the OpenBLAS_jll different from just manually doing using OpenBLAS_jll and BLAS.lbt_forward(...; clear=true)? (It's still worse than a custom build of OpenBLAS though.)

I hope we can improve the default OpenBLAS performance and scaling.

carstenbauer avatar Jul 21 '22 11:07 carstenbauer

@chriselrod Are you able to reproduce this on some of our machines?

ViralBShah avatar Jul 21 '22 21:07 ViralBShah

@chriselrod Are you able to reproduce this on some of our machines?

Not really on deapsea2, which I believe is 32 cores x 2 sockets:

julia> using LinearAlgebra

julia> M = K = N = 16_000;

julia> A = Matrix{Float64}(undef, M, K); B = Matrix{Float64}(undef, K, N); C = Matrix{Float64}(undef, M, N);

julia> let A=A; Threads.@threads for i = eachindex(A); @inbounds A[i] = randn(); end; end;

julia> let A=B; Threads.@threads for i = eachindex(A); @inbounds A[i] = randn(); end; end;

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
996.8121355318401

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1052.3059335030646

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1158.2279430700944

julia> BLAS.set_num_threads(64);

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1424.0752368319172

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1597.7102493574532

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1599.5668391744443

julia> using MKL

julia> BLAS.get_num_threads()
64

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1566.9680236220754

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1688.7183359779663

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1694.5279138377305

julia> BLAS.set_num_threads(32);

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1168.313201204457

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1151.4661029505437

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1112.8236207523348

MKL and OpenBLAS both show about the same speedup. I used Threads.@threads to initialize the arrays because of first touch.

On deapsea2, Octavian thinks we only have 32 cores, and thus refuses to use more than 32 threads. Switching back to hwloc instead of CpuId would fix this:

julia> using Hwloc

julia> Hwloc.num_numa_nodes()
8

julia> Hwloc.num_physical_cores()
64

julia> Hwloc.num_packages()
2

if I recall correctly, we no longer need to support WINE, so I think switching back would be okay.

CpuId seems to figure out we have 32 cores per socket

julia> using CpuId, Octavian

julia> CpuId.cpucores()
32

julia> Octavian.num_cores()
static(32)

but doesn't realize we have two of them.

julia> CpuId.cpunodes()
1

julia> CpuId.cpucores_total()
32

BTW, @carstenbauer also gave me access to the cluster he's running on, but I've been procrastinating on actually signing on and poking around.

chriselrod avatar Jul 22 '22 02:07 chriselrod

julia> using LinearAlgebra

julia> BLAS.get_num_threads()
64

julia> M = K = N = 16_000;

julia> A = Matrix{Float64}(undef, M, K); B = Matrix{Float64}(undef, K, N); C = Matrix{Float64}(undef, M, N);

julia> let A=A; Threads.@threads for i = eachindex(A); @inbounds A[i] = randn(); end; end;

julia> let A=B; Threads.@threads for i = eachindex(A); @inbounds A[i] = randn(); end; end;

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
982.1094539637073

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1016.5287429664286

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
993.1180919251972

julia> BLAS.set_num_threads(128);

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1253.0140262394689

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1224.9438668463697

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1216.3002770861356

julia> using MKL

julia> BLAS.get_num_threads()
64

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1299.188240131717

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1383.8168443837112

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1309.8513547840373

julia> BLAS.set_num_threads(128);

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1340.979151645549

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1308.8692178797796

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1325.0394458234089

julia> BLAS.get_num_threads()
64

This time, CpuId gets it right, and Hwloc gets it wrong???

julia> using Hwloc

julia> Hwloc.num_physical_cores()
64

julia> Hwloc.num_packages()
1

julia> using CpuId

julia> CpuId.cpucores()
64

julia> CpuId.cpucores_total()
128

chriselrod avatar Jul 22 '22 03:07 chriselrod

FWIW, I ran (some of) the same benchmarks on a qualitatively different system (Noctua 2, AMD Milan Zen 3, 2x 64 cores) and also found that the default OpenBLAS doesn't scale. See https://github.com/carstenbauer/julia-dgemm-noctua#noctua-2. (Chris has access to both Noctua 1 and Nocuta 2)

(Might be able to add results from clusters at other HPC centers soon)

carstenbauer avatar Jul 22 '22 05:07 carstenbauer

@chriselrod Which system is this run on? If it's a Noctua 2 login node, be aware that the login nodes in Noctua 2 are special in the sense that they only have 1x64 physical cores but HT enabled (i.e. 128 virtual cores). OTOH, compute nodes have 2x64 physical cores and HT disabled (i.e. 128 physical == virtual cores).

Also, for completeness, which Julia version and how many Julia threads are you using?

carstenbauer avatar Jul 22 '22 06:07 carstenbauer

Is there any news on this issue? I've also encountered somewhat disappointing scaling behavior on two different HPC clusters using Julia 1.11.6, and OpenBLAS and MKL.

m-schwendt avatar Aug 08 '25 14:08 m-schwendt