OrdinaryDiffEq.jl
OrdinaryDiffEq.jl copied to clipboard
Performance drop in Julia 1.6
Hi, recently I noticed that my simulation runs slower than before. Consider the following MWE:
using BenchmarkTools
using OrdinaryDiffEq
using Base.Threads
using Random
N = 512
Random.seed!(42)
u = rand(2N-1, N)
du = similar(u)
function foo(du, u, p, t)
@threads for i ∈ eachindex(u)
du[i] = sin(cos(tan(exp(log(u[i] + 1)))))
end
end
prob = ODEProblem{true}(foo, u, (0.0, 0.01));
@btime foo(du, u, nothing, 1.0);
@btime solve(prob, Tsit5());
I tested on a workstation with two Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz (2 * 12 CPUs).
Julia 1.4.2
12-threads
4.248 ms (63 allocations: 8.59 KiB)
457.320 ms (3781 allocations: 200.29 MiB)
24-threads
2.185 ms (123 allocations: 17.13 KiB)
324.000 ms (7273 allocations: 200.77 MiB)
Julia 1.5.4
12-threads
4.166 ms (61 allocations: 8.80 KiB)
444.515 ms (3650 allocations: 200.30 MiB)
24-threads
2.115 ms (121 allocations: 17.52 KiB)
334.617 ms (7122 allocations: 200.79 MiB)
Julia 1.6.1
12-threads
3.896 ms (61 allocations: 5.42 KiB)
727.287 ms (3721 allocations: 200.12 MiB)
24-threads
2.160 ms (121 allocations: 10.77 KiB)
515.827 ms (7194 allocations: 200.42 MiB)
Despite that the benchmark of the ODE function call is no worse on Julia 1.6, the total solving time is actually much longer. I am not sure if it relates to some upstream issues. It seems there is some extra overhead besides the ODE function evaluation.
Thank you!
Which version of OrdinaryDiffEq.jl were you using? Share your manifest.
Also, try solve(prob, Tsit5(), save_everystep=false)
@chriselrod did you see similar behavior? IIRC you were the one looking at the updated performance.
Hi @ChrisRackauckas,
- Julia 1.4.2 - OrdinaryDiffEq v5.55.1
(@v1.4) pkg> st --manifest OrdinaryDiffEq
Status `~/.julia/environments/v1.4/Manifest.toml`
[79e6a3ab] Adapt v3.3.0
[4fba245c] ArrayInterface v3.1.15
[864edb3b] DataStructures v0.18.9
[2b5f629d] DiffEqBase v6.62.2
[ffbed154] DocStringExtensions v0.8.4
[d4d017d3] ExponentialUtilities v1.8.4
[9aa1b823] FastClosures v0.3.2
[6a86dc24] FiniteDiff v2.8.0
[f6369f11] ForwardDiff v0.10.18
[1914dd2f] MacroTools v0.5.6
[46d2c3a1] MuladdMacro v0.2.2
[2774e3e8] NLsolve v4.5.1
[1dea7af3] OrdinaryDiffEq v5.55.1
[731186ca] RecursiveArrayTools v2.11.4
[189a3867] Reexport v1.0.0
[47a9eef4] SparseDiffTools v1.13.2
[90137ffa] StaticArrays v1.2.0
[3a884ed6] UnPack v1.0.2
[37e2e46d] LinearAlgebra
[56ddb016] Logging
[2f01184e] SparseArrays
- Julia 1.5.4 - OrdinaryDiffEq v5.56.0 (
st --manifestdoes not work, I usedevandst)
(OrdinaryDiffEq) pkg> st
Project OrdinaryDiffEq v5.56.0
Status `~/.julia/dev/OrdinaryDiffEq/Project.toml`
[79e6a3ab] Adapt v3.3.0
[4fba245c] ArrayInterface v3.1.15
[864edb3b] DataStructures v0.18.9
[2b5f629d] DiffEqBase v6.62.2
[ffbed154] DocStringExtensions v0.8.4
[d4d017d3] ExponentialUtilities v1.8.4
[9aa1b823] FastClosures v0.3.2
[6a86dc24] FiniteDiff v2.8.0
[f6369f11] ForwardDiff v0.10.18
[1914dd2f] MacroTools v0.5.6
[46d2c3a1] MuladdMacro v0.2.2
[2774e3e8] NLsolve v4.5.1
[f517fe37] Polyester v0.3.1
[731186ca] RecursiveArrayTools v2.11.4
[189a3867] Reexport v1.0.0
[47a9eef4] SparseDiffTools v1.13.2
[90137ffa] StaticArrays v1.2.0
[3a884ed6] UnPack v1.0.2
[37e2e46d] LinearAlgebra
[56ddb016] Logging
[2f01184e] SparseArrays
- Julia 1.6.1 - OrdinaryDiffEq v5.56.0 (
st --manifestdoes not work, I usedevandst)
(OrdinaryDiffEq) pkg> st
Project OrdinaryDiffEq v5.56.0
Status `~/.julia/dev/OrdinaryDiffEq/Project.toml`
[79e6a3ab] Adapt v3.3.0
[4fba245c] ArrayInterface v3.1.15
[864edb3b] DataStructures v0.18.9
[2b5f629d] DiffEqBase v6.62.2
[ffbed154] DocStringExtensions v0.8.4
[d4d017d3] ExponentialUtilities v1.8.4
[9aa1b823] FastClosures v0.3.2
[6a86dc24] FiniteDiff v2.8.0
[f6369f11] ForwardDiff v0.10.18
[1914dd2f] MacroTools v0.5.6
[46d2c3a1] MuladdMacro v0.2.2
[2774e3e8] NLsolve v4.5.1
[f517fe37] Polyester v0.3.1
[731186ca] RecursiveArrayTools v2.11.4
[189a3867] Reexport v1.0.0
[47a9eef4] SparseDiffTools v1.13.2
[90137ffa] StaticArrays v1.2.0
[3a884ed6] UnPack v1.0.2
[37e2e46d] LinearAlgebra
[56ddb016] Logging
[2f01184e] SparseArrays
Another test:
Also, try solve(prob, Tsit5(), save_everystep=false)
Julia 1.4.2
12 threads
4.248 ms (63 allocations: 8.59 KiB)
394.610 ms (3692 allocations: 76.41 MiB)
24 threads
2.186 ms (123 allocations: 17.13 KiB)
272.353 ms (7127 allocations: 76.88 MiB)
Julia 1.5.4
12 threads
4.159 ms (61 allocations: 8.80 KiB)
383.675 ms (3558 allocations: 76.42 MiB)
24 threads
2.301 ms (121 allocations: 17.52 KiB)
264.515 ms (6991 allocations: 76.91 MiB)
Julia 1.6.1
12 threads
3.888 ms (61 allocations: 5.42 KiB)
612.022 ms (3666 allocations: 76.24 MiB)
24 threads
2.230 ms (121 allocations: 10.77 KiB)
475.833 ms (7156 allocations: 76.54 MiB)
The overhead still exists.
I'll take a look at this.
I can't reproduce this. 1.5:
julia> @time using OrdinaryDiffEq, Base.Threads, Random
0.000234 seconds (999 allocations: 65.375 KiB)
julia> N = 512;
julia> Random.seed!(42); u = rand(2N-1, N); du = similar(u);
julia> function foo(du, u, p, t)
@threads for i ∈ eachindex(u)
du[i] = sin(cos(tan(exp(log(u[i] + 1)))))
end
end
foo (generic function with 1 method)
julia> prob = ODEProblem{true}(foo, u, (0.0, 0.01));
julia> @btime foo($du, $u, nothing, 1.0);
1.306 ms (181 allocations: 26.23 KiB)
julia> @btime solve($prob, Tsit5());
169.109 ms (10446 allocations: 201.27 MiB)
julia> versioninfo()
Julia Version 1.5.0
Commit 96786e22cc (2020-08-01 23:44 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
JULIA_NUM_THREADS = 36
1.6:
julia> @time using OrdinaryDiffEq, Base.Threads, Random
0.019373 seconds (32.44 k allocations: 1.874 MiB, 98.95% compilation time)
julia> N = 512;
julia> Random.seed!(42); u = rand(2N-1, N); du = similar(u);
julia> function foo(du, u, p, t)
@threads for i ∈ eachindex(u)
du[i] = sin(cos(tan(exp(log(u[i] + 1)))))
end
end
foo (generic function with 1 method)
julia> prob = ODEProblem{true}(foo, u, (0.0, 0.01));
julia> @btime foo($du, $u, nothing, 1.0);
1.283 ms (181 allocations: 16.11 KiB)
julia> @btime solve($prob, Tsit5());
162.847 ms (10450 allocations: 200.71 MiB)
julia> versioninfo()
Julia Version 1.6.2-pre.2
Commit ff1827d117* (2021-05-02 02:37 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
JULIA_NUM_THREADS = 36
1.7:
julia> @time using OrdinaryDiffEq, Base.Threads, Random
0.022232 seconds (32.76 k allocations: 1.872 MiB, 99.33% compilation time)
julia> N = 512;
julia> Random.seed!(42); u = rand(2N-1, N); du = similar(u);
julia> function foo(du, u, p, t)
@threads for i ∈ eachindex(u)
du[i] = sin(cos(tan(exp(log(u[i] + 1)))))
end
end
foo (generic function with 1 method)
julia> prob = ODEProblem{true}(foo, u, (0.0, 0.01));
julia> @btime foo($du, $u, nothing, 1.0);
1.304 ms (181 allocations: 16.11 KiB)
julia> @btime solve($prob, Tsit5());
166.454 ms (10450 allocations: 200.71 MiB)
julia> versioninfo()
Julia Version 1.7.0-DEV.1124
Commit d18cf93bac* (2021-05-19 16:11 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
JULIA_NUM_THREADS = 36
And I'm the complete opposite!
Julia v1.5:
7.704 ms (31 allocations: 5.09 KiB)
610.698 ms (1949 allocations: 200.10 MiB)
Julia v1.6:
7.164 ms (31 allocations: 3.12 KiB)
598.805 ms (1970 allocations: 199.99 MiB)
Julia's Base RNG changed between the versions. Try it with StableRNGs.jl and see if that "fixes" it, i.e. that it's just a difference in the dynamical system.
And I'm the complete opposite!
For me, 1.5 was 3.8% slower than 1.6 and not just 2% slower ;) .
I also have a very similar CPU to OP (Cascadelake is just a Skylake-AVX512 refresh), the biggest difference is single socket vs dual socket. So I don't think it's architectural.
As I mentioned on discourse, I do start Julia with -C"native,-prefer-256-bit", but that only made around a 5% difference, i.e. just enough to make 1.5 a wee bit faster than 1.6.
Not the 2x regression.
But, I realize I'm not on the official Julia binaries. I built from source with some minor options. I do not think these should matter much:
MARCH=native
JULIA_CPU_TARGET=native
USE_BINARYBUILDER=0
USE_BINARYBUILDER_LLVM=0
USE_BINARYBUILDER_MBEDTLS=1
USE_BINARYBUILDER_OPENBLAS=1
OPENBLAS_TARGET_ARCH=SKYLAKEX
OPENBLAS_DYNAMIC_ARCH=0
I also edited deps/blas.mk to make OpenBLAS build with support for 18 threads, instead of just 16.
Can you quickly grab a generic binary and see? This is just a really strange day...
I feel like "FastBroadcast.jl" is the new "thanks, Obama". My docs aren't building? Thanks, Fastbroadcast.jl.
I'll try an official binary.
It's the only thing I can think of that can be CPU-dependent here! It's not even using BLAS!
Everything default, not even an -O3:
> ../julia-1.6.1/bin/julia --project=~/Documents/progwork/julia/env/diffeqold/ (base)
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _| | |
| | |_| | | | (_| | | Version 1.6.1 (2021-04-23)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
(diffeqold) pkg> st
Status `~/Documents/progwork/julia/env/diffeqold/Project.toml`
[1dea7af3] OrdinaryDiffEq v5.56.0
(diffeqold) pkg> up
Updating registry at `~/.julia/registries/General`
Updating git-repo `https://github.com/JuliaRegistries/General.git`
Downloaded artifact: OpenSpecFun
No Changes to `~/Documents/progwork/julia/env/diffeqold/Project.toml`
Updating `~/Documents/progwork/julia/env/diffeqold/Manifest.toml`
[efe28fd5] ↑ OpenSpecFun_jll v0.5.3+4 ⇒ v0.5.4+0
[0dad84c5] + ArgTools
[56f22d72] ~ Artifacts v1.3.0 ⇒
[f43a241f] + Downloads
[b27032c2] + LibCURL
[ca575930] + NetworkOptions
[fa267f1f] ~ TOML v1.0.3 ⇒
[a4e569a6] + Tar
[e66e0078] ~ CompilerSupportLibraries_jll v0.3.4+0 ⇒
[deac9b47] + LibCURL_jll
[29816b5a] + LibSSH2_jll
[c8ffd9c3] + MbedTLS_jll
[14a3606d] + MozillaCACerts_jll
[83775a58] + Zlib_jll
[8e850ede] + nghttp2_jll
[3f19e933] + p7zip_jll
Precompiling project...
46 dependencies successfully precompiled in 56 seconds (28 already precompiled)
julia> @time using OrdinaryDiffEq, Base.Threads, Random
5.944378 seconds (13.48 M allocations: 880.403 MiB, 6.54% gc time)
julia> N = 512;
julia> Random.seed!(42); u = rand(2N-1, N); du = similar(u);
julia> function foo(du, u, p, t)
@threads for i ∈ eachindex(u)
du[i] = sin(cos(tan(exp(log(u[i] + 1)))))
end
end
foo (generic function with 1 method)
julia> prob = ODEProblem{true}(foo, u, (0.0, 0.01));
julia> @btime foo($du, $u, nothing, 1.0);
1.328 ms (181 allocations: 16.11 KiB)
julia> @btime solve($prob, Tsit5());
167.846 ms (10450 allocations: 200.71 MiB)
julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
JULIA_NUM_THREADS = 36
@shipengcheng1230 Could you profile Julia 1.6 and 1.5, and find out where 1.6 is taking a lot more time for you? As none of us can reproduce your problem, there's not really anything we can do at the moment.
@chriselrod, I used the vscode @profview, here is the result

There seem no particular "bad" points. Unfortunately, the problem still persists where I use the official Julia binary with everything fresh. I saw @chriselrod use twice threads number as the number of physical cores, never knew that could get better performance! Again on two Intel(R) Xeon(R) Gold 6136 CPU
Julia 1.5.4
12 threads
4.137 ms (61 allocations: 8.80 KiB)
385.511 ms (3560 allocations: 76.42 MiB)
24 threads
2.096 ms (121 allocations: 17.52 KiB)
279.497 ms (7012 allocations: 76.91 MiB)
48 threads
1.217 ms (241 allocations: 34.98 KiB)
232.280 ms (13894 allocations: 77.88 MiB)
Julia 1.6.1
12 threads
3.896 ms (61 allocations: 5.42 KiB)
588.468 ms (3665 allocations: 76.24 MiB)
24 threads
2.212 ms (121 allocations: 10.77 KiB)
511.847 ms (7080 allocations: 76.53 MiB)
48 threads
1.322 ms (229 allocations: 21.11 KiB)
356.515 ms (14112 allocations: 77.14 MiB)
Julia 1.7.0 - nightly
12 threads
3.941 ms (61 allocations: 5.42 KiB)
598.486 ms (3654 allocations: 76.24 MiB)
24 threads
2.020 ms (121 allocations: 10.77 KiB)
426.947 ms (7119 allocations: 76.54 MiB)
48 threads
1.278 ms (241 allocations: 21.48 KiB)
359.284 ms (14068 allocations: 77.14 MiB)
Problem disappears on this computer (one Intel(R) Core(TM) i9-10920X, 12 CPUs)
Julia 1.5.4
12 threads
3.503 ms (61 allocations: 8.80 KiB)
349.265 ms (3611 allocations: 200.30 MiB)
24 threads
2.100 ms (121 allocations: 17.52 KiB)
275.177 ms (7033 allocations: 200.79 MiB)
Julia 1.6.1
12 threads
3.225 ms (61 allocations: 5.42 KiB)
352.217 ms (3624 allocations: 200.12 MiB)
24 threads
1.970 ms (121 allocations: 10.77 KiB)
281.144 ms (7047 allocations: 200.41 MiB)
No big issue on this machine (two Intel® Xeon® Gold 5220, 2 * 18 CPUs, however, setting up 72 threads does not improve performance as above) either.
Julia 1.5.4
18 Threads
3.241 ms (91 allocations: 13.16 KiB)
353.726 ms (5344 allocations: 200.55 MiB)
36 Threads
1.876 ms (181 allocations: 26.23 KiB)
284.659 ms (10495 allocations: 201.27 MiB)
72 Threads
1.880 ms (181 allocations: 26.23 KiB)
281.621 ms (10494 allocations: 201.27 MiB)
Julia 1.6.1
18 Threads
3.016 ms (91 allocations: 8.09 KiB)
350.281 ms (5413 allocations: 200.27 MiB)
36 Threads
1.837 ms (181 allocations: 16.11 KiB)
294.028 ms (10673 allocations: 200.72 MiB)
72 Threads
1.988 ms (361 allocations: 32.16 KiB)
309.172 ms (20876 allocations: 201.61 MiB)
It is actually on the last machine (two Intel® Xeon® Gold 5220) where I noticed that my simulation ran significantly slower on Julia 1.6 than 1.5 (everything else is the same). The ODEs I solve mainly involve element-wise computation and FFT for now. I tried to isolate the issue but guess it is much more convoluted than I thought.
Thanks again for the help!