julia icon indicating copy to clipboard operation
julia copied to clipboard

Performance regression up to 20% when updating from Julia v1.10.4 to v1.11.0-rc1

Open benegee opened this issue 1 year ago • 24 comments

We have recently started testing julia 1.11 in our projects Trixi.jl and TrixiParticles.jl. We observed performance regressions up to 20%. This has been confirmed on laptops (Apple M2, M3), a workstation (AMD Ryzen Threadripper 3990X), and the JUWELS HPC system at Jülich supercomputing centre (AMD EPYC Rome 7402 CPU).

@sloede has put together instructions and a MWE here: https://github.com/trixi-framework/performance-regression-julia-v1.10-vs-v1.11 It boils down to adding Trixi.jl and running BenchmarkTools for an isolated function in Trixi.jl.

using Trixi
using BenchmarkTools

equations = CompressibleEulerEquations3D(1.4)

volume_flux = flux_ranocha
solver = DGSEM(polydeg=3, surface_flux=flux_ranocha,
               volume_integral=VolumeIntegralFluxDifferencing(volume_flux))

coordinates_min = (0.0, 0.0, 0.0)
coordinates_max = (2.0, 2.0, 2.0)

trees_per_dimension = (4, 4, 4)

mesh = P4estMesh(trees_per_dimension, polydeg=1,
                 coordinates_min=coordinates_min, coordinates_max=coordinates_max,
                 periodicity=true, initial_refinement_level=1)

semi = SemidiscretizationHyperbolic(mesh, equations, initial_condition_convergence_test,
                                    solver)

u_ode = compute_coefficients(0.0, semi)
du_ode = similar(u_ode)

@benchmark Trixi.rhs!($du_ode, $u_ode, $semi, 0.0)

For this code we get on our workstation a median time of 6.079 ms with julia v1.10.4 vs. 7.200 ms with v1.11.0-rc1.

@sloede @ranocha @vchuravy @efaulhaber @LasNikas

benegee avatar Jul 03 '24 12:07 benegee

Would it be possible to profile this to identify the point of regression?

jishnub avatar Jul 03 '24 12:07 jishnub

Thanks for picking this up!

For a start, running @profile on 1000 iterations of rhs! and using PProf gives these flamegraphs (top: 1.10.4, bottom: 1.11-rc1) 1 10 4vs1 11

We can of course dig deeper.

benegee avatar Jul 03 '24 14:07 benegee

The difference looks to be some getindex calls that now appear in the flamegraph.

gbaraldi avatar Jul 03 '24 14:07 gbaraldi

Yes! For 1.11 we have checkbounds beneath these calls, which make up a fair amount of samples. For 1.10 they are not captured.

benegee avatar Jul 03 '24 15:07 benegee

Would be good to get a MWE that gets a bit closer to those suspected functions.

KristofferC avatar Jul 03 '24 15:07 KristofferC

The issue is not the checkbounds btw. There's a whole getindex call that seems to be here. It might be Memory{T} bottom is 1.11 image image

gbaraldi avatar Jul 03 '24 15:07 gbaraldi

Yeah, it looks indeed like the calls to get_contravariant_vector (get_contrav...) before get_node_vars and getindex after get_node_vars are much more expensive on Julia v1.11...

For reference: this code is called from

https://github.com/trixi-framework/Trixi.jl/blob/b4eef6d87d1e5a737ab31ad9c1e5bad293a19b4d/src/solvers/dgsem_structured/dg_3d.jl#L143-L156

get_contravariant_vector is defined in

https://github.com/trixi-framework/Trixi.jl/blob/b4eef6d87d1e5a737ab31ad9c1e5bad293a19b4d/src/solvers/dgsem_structured/dg.jl#L26-L29

ranocha avatar Jul 03 '24 18:07 ranocha

So there are two commits that might deserve testing here. https://github.com/JuliaLang/julia/commit/9aa7980358349ee7017fa614525f571ffa92c55d as reported in https://github.com/JuliaLang/julia/issues/53158#issuecomment-1924283666 and https://github.com/JuliaLang/julia/pull/51720

gbaraldi avatar Jul 03 '24 18:07 gbaraldi

I do think we might need to actually do the initialized memory zeroing so we can revert that change. it has caused a lot of regressions.

oscardssmith avatar Jul 03 '24 18:07 oscardssmith

Would be good to get a MWE that gets a bit closer to those suspected functions.

After running the code posted by @benegee above, you can directly benchmark get_contravariant_vector by executing the following lines:

@unpack contravariant_vectors = semi.cache.elements

ii = j = k = element = 1

@benchmark Trixi.get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)

On our Rocinante (AMD Ryzen), this gives me the following:

v1.10.4

julia> @benchmark Trixi.get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.930 ns … 32.081 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.050 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.061 ns ±  0.309 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

            █ ▂
  ▂▂▁▂▂▁▁▁▁▃█▄█▇▃▅▅▂▂▂▂▁▂▁▁▂▁▁▂▁▁▁▁▂▂▂▂▂▁▁▁▂▂▁▂▂▁▂▂▂▂▁▂▂▂▂▁▂ ▂
  3.93 ns        Histogram: frequency by time        4.47 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

v1.11.0-rc1

julia> @benchmark Trixi.get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  7.147 ns … 15.917 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.167 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.186 ns ±  0.183 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █▃
  ▂██▂▂▅▃▆▂▂▂▁▁▁▂▂▁▂▂▁▁▁▂▁▁▁▁▁▂▁▂▁▁▁▁▁▁▂▁▂▁▂▁▂▁▁▂▂▁▂▂▁▁▁▁▂▁▂ ▂
  7.15 ns        Histogram: frequency by time        7.64 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

sloede avatar Jul 04 '24 04:07 sloede

@gbaraldi @oscardssmith thanks for your input! It would be great if you could continue to help with the analysis of the root cause and to understand what needs to be fixed on the Julia side, as we do not have the expertise for something this close to the metal.

I know everyone has a full plate, but with this already being v1.11 RC1, on our side multiple research groups are slowly getting into panic mode 😅. Losing the performance boost of the now-defunct --check-bounds=no was already hard to swallow, but a 20-30% drop in serial performance would be near catastrophic for us in the HPC world.

sloede avatar Jul 05 '24 03:07 sloede

Julia 1.10 will likely become an LTS release once 1.11 has been released. We will of course see if we can fix this in a patch release, but performance regression are not release blockers.

vchuravy avatar Jul 05 '24 05:07 vchuravy

I've bisected this to 909bceae8a6, the big Memory{T} PR. There seems to be a problem with optimizing getindex with the new Memory{T}-based arrays. Some of the performance difference, but not all, seems to go away by rewriting get_contravariant_vector slightly.

@inline function get_contravariant_vector(index, contravariant_vectors, indices...)
    cv = @view(contravariant_vectors[:, index, indices...])
    SVector(ntuple(dim->cv[dim], Val(ndims(contravariant_vectors) - 3)))
end

sgaure avatar Jul 05 '24 06:07 sgaure

Seems mostly like a typo there too, since in the original code @inline(dim->contravariant_vectors[dim, index, indices...]),, the @inbounds does not apply to any call expression. Though it was good to see that the code previously worked well without needing to manually override the correctness of the bounds checking.

vtjnash avatar Jul 05 '24 20:07 vtjnash

I've bisected this to 909bcea, the big Memory{T} PR. There seems to be a problem with optimizing getindex with the new Memory{T}-based arrays. Some of the performance difference, but not all, seems to go away by rewriting get_contravariant_vector slightly.

@inline function get_contravariant_vector(index, contravariant_vectors, indices...)
    cv = @view(contravariant_vectors[:, index, indices...])
    SVector(ntuple(dim->cv[dim], Val(ndims(contravariant_vectors) - 3)))
end

@sgaure Thanks a lot for taking a look. Your implementation does indeed recover some of the performance in v1.11.0-rc1. Interestingly, with it the v1.10.4 implementation is as fast as the v1.11.0-rc1 implementation, i.e., while it does make v1.11.0-rc1 faster, it actually slows down execution on v1.10.4.

On our Rocinante (AMD Ryzen Threadripper):

v1.10.4

julia> @benchmark Trixi.get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.930 ns … 20.820 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.030 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.054 ns ±  0.217 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                        █ ▆ ▅ ▃    ▅ ▂                   ▁ ▁ ▁
  ▃▁▅▁▆▁▃▁▁▁▁▁▁▁▁▁▁▁▁█▄▁█▁█▁█▁█▇▁█▁█▁█▁▄▁▄▁▁▇▁▃▁▄▁▃▁▁▄▁▄▁█▁█ █
  3.93 ns      Histogram: log(frequency) by time     4.19 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sgaure_get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  5.780 ns … 16.460 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.780 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.795 ns ±  0.183 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █
  █▄▁▁▁▁▁▁▁▁▁▁▁▁█▃▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  5.78 ns        Histogram: frequency by time        5.82 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

v1.11.0-rc1

julia> @benchmark Trixi.get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  7.147 ns … 17.998 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.157 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.172 ns ±  0.195 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█                                                         ▁
  ██▁█▁▃▁▃▁▁▁▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃▁▁▃▁▁▁▃▁▃▁▁▄▁▄▁▃▁▄▁▄▁▁▅ █
  7.15 ns      Histogram: log(frequency) by time     7.47 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sgaure_get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  5.770 ns … 12.080 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.800 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.819 ns ±  0.158 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁  █    ▁
  ▃█▁▄█▁▇▁▄█▁▅▂▂▁▁▂▂▁▁▁▁▂▁▂▂▁▂▁▂▁▁▁▂▁▁▁▂▁▁▁▂▁▂▁▂▁▁▂▁▁▂▁▂▂▁▁▂ ▂
  5.77 ns        Histogram: frequency by time        6.13 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

It thus seems that v1.10.4 is still able to optimize more of the code away in the original implementation, and v1.11.0-rc1 is not able to fully catch up. It's a start, but it does explain why v1.11.0-rc1 is not able to fully optimize the code yet, and it would be nice to also get the missing ~50% of performance back.

Furthermore, it would be nice to also understand why getindex starts showing up in the profile for v1.11.0-rc1.

sloede avatar Jul 07 '24 05:07 sloede

Another interesting observation:

@sgaure I took your implementation for get_contravariant_vector and applied it also to get_node_vars:

@inline function sgaure_get_node_vars(u, equations, solver::DG, indices...)
    uv = @view(u[:, indices...])
    SVector(ntuple(idx->uv[idx], Val(nvariables(equations))))
end

with

u = Trixi.wrap_array(u_ode, semi)

ii = j = k = element = 1

@benchmark Trixi.get_node_vars($u, $equations, $solver, $ii, $j, $k, $element)

@benchmark sgaure_get_node_vars($u, $equations, $solver, $ii, $j, $k, $element)

it turns out that v1.10.4 and v1.11.0-rc1 are again equally fast and, to my surprise, both are faster than our original implementation on v1.10.4.

On our Rocinante (AMD Ryzen Threadripper):

v1.10.4

julia> @benchmark Trixi.get_node_vars($u, $equations, $solver, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.930 ns … 14.730 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.930 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.943 ns ±  0.188 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █
  █▁▁▁▁▁▆▂▁▁▁▁▂▂▁▁▁▁▁▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  3.93 ns        Histogram: frequency by time        4.02 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sgaure_get_node_vars($u, $equations, $solver, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.240 ns … 22.801 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.240 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.251 ns ±  0.220 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █
  █▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  3.24 ns        Histogram: frequency by time        3.26 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

v1.11.0-rc1

julia> @benchmark Trixi.get_node_vars($u, $equations, $solver, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.930 ns … 48.041 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.930 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.946 ns ±  0.457 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                   ▂
  █▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▃ ▂
  3.93 ns        Histogram: frequency by time        3.94 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sgaure_get_node_vars($u, $equations, $solver, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.240 ns … 21.401 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.240 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.252 ns ±  0.214 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █
  █▁▄▁▂▁▂▁▁▁▁▁▁▂▁▂▁▂▁▁▂▁▂▁▂▁▂▁▁▁▁▂▁▁▁▂▁▁▁▁▁▁▂▁▂▁▂▁▁▂▁▂▁▂▁▂▁▂ ▂
  3.24 ns        Histogram: frequency by time         3.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, what's up with that? Seeing this, maybe you could take another look at get_contravariant_vector to massage it to getting the original performance back, @sgaure?

sloede avatar Jul 07 '24 05:07 sloede

The proper fix to this is to revive https://reviews.llvm.org/D136218 and make it correct

gbaraldi avatar Jul 07 '24 16:07 gbaraldi

Confirming @sloede 's observations: With the alternative implementation of get_node_vars we see an improvement both with 1.10.4 and 1.11.0-rc1. For get_contravariant_vector we see an improvement with 1.11.0-rc1 but a slowdown with 1.10.4.

In the end, the alternative implementation recovers some of the performance in v1.11.0-rc1, but unfortunately the overall performance for the higher-level functions rhs! or flux_differencing_kernel! is still best with 1.10.4 and the original implementation.

benegee avatar Jul 08 '24 20:07 benegee

... seeing a similar regression in VoronoiFVM.jl nonlinear stationary solver examples, with KrylovJL_GMRES & ILUZero.ilu0 preconditioner.

j-fu avatar Jul 12 '24 21:07 j-fu

@gbaraldi @oscardssmith Thanks a lot for the very informative and cordial discussions about performance at JuliaCon ♥️.

Did you, by any change, make progress regarding the proposed LLVM patch and if yes, is it already testable somewhere (preferably with pre-built binaries)?

sloede avatar Jul 16 '24 12:07 sloede

Am working on it. (it's not super trivial since it's basically a general store sinking pass). Given that, I would consider it not release blocking (So that the release gets out and we may find more issues). I believe it's a very self contained thing so we can backport it quite easily.

gbaraldi avatar Jul 23 '24 15:07 gbaraldi

With v1.11.0-rc3 being released, I reran the benchmarks above.

TL;DR No overall performance improvements from v1.11.0-rc1 to v1.11.0-rc3 😢

Each result is the minimum time obtained from using @benchmark ... as posted above (in ms for rhs! and in ns for the rest):

Overall

  v1.10.4 v1.11.0-rc1 v1.11.0-rc3 1.10 → 1.11-rc1 1.10 → 1.11-rc3
Trixi.rhs! [ms] 6.039 7.031 6.997 16% 16%
Trixi.get_contravariant_vector [ns] 3.940 7.147 7.147 81% 81%
sgaure_get_contravariant_vector [ns] 5.430 5.770 5.310 6% -2%
Trixi.get_node_vars [ns] 3.420 3.320 3.320 -3% -3%
sgaure_get_node_vars [ns] 3.010 2.550 2.780 -15% -8%

As written above, the overall times have, unfortunately, not improved at all. However, the sgaure_get_contravariant_vector implementation is now faster than before (now also faster than v1.10), which is great. Unfortunately, the sgaure_get_node_vars has a small regression compared to v1.11-rc1 (though still faster than v1.10.4)

At the moment it still looks like we should go with the implementations of @sgaure, but an overall performance regression of 16% is still very sad.

@gbaraldi @oscardssmith Did you manage to fix the known LLVM-related (?) performance regression we talked about at JuliaCon during the subsequent hackathon? And if yes, should/would this have already been reflected in the numbers for rc3 above?

sloede avatar Aug 27 '24 07:08 sloede

I haven't yet sorry. The LLVM thing is more involved than I had imagined earlier and basically requires writing a full pass.

gbaraldi avatar Aug 27 '24 14:08 gbaraldi

Thanks for keeping at it! I'll keep my fingers crossed 🤞

sloede avatar Aug 27 '24 15:08 sloede

Is there a chance that rc4 might have fixed some of the performance regressions reported here?

sloede avatar Sep 28 '24 11:09 sloede

@sloede We have tested it with RC4 its still about 20% slower than 1.10.5. RC2 though was almost 30% slower than 1.10.4.

svchb avatar Sep 28 '24 14:09 svchb

Since this is apparently an upstream issue (https://reviews.llvm.org/D136218) and we are releasing v1.12 soon, I'm cleaning up old v1.11 issues. Seems like the difference has reduced, but still present on master vs 1.10 (on x86_64-linux-gnu):

julia> @benchmark Trixi.get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
 Range (min … max):  4.709 ns … 11.910 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.720 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.772 ns ±  0.216 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆ █  ▁ ▁ ▂    ▃  ▄               ▁      ▁▅ ▆             ▁ ▂
  █▁█▁▅█▁█▁█▁▁▇▁█▁▇█▁▆▁▆▁▁▆▁▄▁▁▃▁▃▁█▁▁▇▁▄▁██▁█▁█▁▁▇▁▆▁▁▅▁█▁█ █
  4.71 ns      Histogram: log(frequency) by time     4.95 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.


julia> @benchmark Trixi.get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
 Range (min … max):  6.869 ns … 16.760 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.930 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.013 ns ±  0.293 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

             █                                 ▁              
  ▂▄▁▂▁▂▁▁▁█▃█▂▁▂▁▂▁▂▂▃▄▁▂▁▂▁▁▁▂▁▁▂▁▂▁▂▂▂▁▁▂▁▂▁█▂▆▁▁▂▁▁▁▂▁▃▃ ▂
  6.87 ns        Histogram: frequency by time        7.19 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

vtjnash avatar Sep 30 '25 17:09 vtjnash