StaticArrays.jl icon indicating copy to clipboard operation
StaticArrays.jl copied to clipboard

Allocations with StaticArrays in Julia v1.11

Open kaipartmann opened this issue 1 year ago • 4 comments

I am the main author of the package Peridynamics.jl and recently changed to Julia v1.11. With the new Julia version, a significant increase in the simulation time can be found when using a certain material.

Peridynamics.jl Benchmark

Please see the following benchmark, which is also described in more detail in the tutorials (Tutorial mode I fracture).

using Peridynamics

function mode_i(mat, npyz)
    l, Δx, δ, a = 1.0, 1/npyz, 3.015/npyz, 0.5
    pos, vol = uniform_box(l, l, 0.1l, Δx)
    ids = sortperm(pos[2,:])
    body = Body(mat, pos[:, ids], vol[ids])
    material!(body; horizon=3.015Δx, E=2.1e5, nu=0.25, rho=8e-6, Gc=2.7)
    point_set!(p -> p[1] ≤ -l/2+a && 0 ≤ p[2] ≤ 2δ, body, :set_a)
    point_set!(p -> p[1] ≤ -l/2+a && -2δ ≤ p[2] < 0, body, :set_b)
    precrack!(body, :set_a, :set_b)
    point_set!(p -> p[2] > l/2-Δx, body, :set_top)
    point_set!(p -> p[2] < -l/2+Δx, body, :set_bottom)
    velocity_bc!(t -> -30, body, :set_bottom, :y)
    velocity_bc!(t -> 30, body, :set_top, :y)
    vv = VelocityVerlet(steps=2000)
    job = Job(body, vv; path=joinpath(@__DIR__, "results", "mode_i"))
    @time submit(job)
    return nothing
end

mode_i(NOSBMaterial(), 25) # compilation
mode_i(NOSBMaterial(), 50) # work
❯ julia +1.10.5 --project -t 6 mode_i_benchmark.jl
  0.719477 seconds (1.39 M allocations: 450.082 MiB, 2.77% gc time, 74.16% compilation time)
  6.090774 seconds (1.28 M allocations: 1.118 GiB, 1.06% gc time, 2.17% compilation time)

❯ julia +1.11.1 --project -t 6 mode_i_benchmark.jl
  9.574071 seconds (2.15 G allocations: 82.415 GiB, 15.59% gc time, 6.09% compilation time)
177.261652 seconds (40.34 G allocations: 1.501 TiB, 23.15% gc time)

The same simulation takes 29x longer on v1.11.1 than v1.10.5.

When investigating the problem, I narrowed it down to the calculation of the deformation gradient: https://github.com/kaipartmann/Peridynamics.jl/blob/f9b8fe60e743d828cf96d2366610c6a69d7ac054/src/physics/correspondence.jl#L174-L194

MWE

The same behavior can be reproduced with a small MWE:

using LinearAlgebra, StaticArrays, BenchmarkTools

function mwe1(a, X, n)
    K = zeros(SMatrix{3,3})
    for i in 1:n
        k = a * i
        K += k * X * X'
    end
    return K
end

@btime mwe1(1e-5, SVector{3}(1.0, 1.0, 1.0), 1_000_000);
❯ julia +1.10.5 --project -t 6 mwe.jl 
  1.070 ms (0 allocations: 0 bytes)

❯ julia +1.11.1 --project -t 6 mwe.jl 
  129.890 ms (7000000 allocations: 289.92 MiB)

I think the problem is related to allocations when using += with a SMatrix or SVector.

As a quick fix, rewriting the calculation by hand leads to even slightly better performance on v1.11:

function mwe2(a, X, n)
    K11, K12, K13 = 0.0, 0.0, 0.0
    K21, K22, K23 = 0.0, 0.0, 0.0
    K31, K32, K33 = 0.0, 0.0, 0.0
    X1, X2, X3 = X[1], X[2], X[3]
    for i in 1:n
        k = a * i
        K11 += k * X1 * X1
        K12 += k * X1 * X2
        K13 += k * X1 * X3
        K21 += k * X2 * X1
        K22 += k * X2 * X2
        K23 += k * X2 * X3
        K31 += k * X3 * X1
        K32 += k * X3 * X2
        K33 += k * X3 * X3
    end
    K = SMatrix{3,3}(K11, K21, K31, K12, K22, K32, K13, K23, K33)
    return K
end

@btime mwe2(1e-5, SVector{3}(1.0, 1.0, 1.0), 1_000_000);
❯ julia +1.10.5 --project -t 6 mwe.jl 
  1.211 ms (0 allocations: 0 bytes)

❯ julia +1.11.1 --project -t 6 mwe.jl 
  1.072 ms (0 allocations: 0 bytes)

kaipartmann avatar Oct 22 '24 09:10 kaipartmann

Interesting. This looks like a regression in Julia compiler. I'd have expected call-site @inline to fix this but it does not.

mateuszbaran avatar Oct 22 '24 09:10 mateuszbaran

I think this should have a corresponding issue in the julia repo since it is probably a regression there.

KristofferC avatar Oct 22 '24 11:10 KristofferC

@KristofferC I opened an issue in the julia repo. Interestingly, the problem can be solved by changing k * X * X' to k * (X * X'):

function mwe3(a, X, n)
    K = zeros(SMatrix{3,3})
    for i in 1:n
        k = a * i
        K += k * (X * X')
    end
    return K
end
@btime mwe3(1e-5, SVector{3}(1.0, 1.0, 1.0), 1_000_000);
❯ julia +1.10.5 --project -t 6 mwe.jl 
  805.458 μs (0 allocations: 0 bytes)

❯ julia +1.11.1 --project -t 6 mwe.jl
  708.208 μs (0 allocations: 0 bytes)

Since the problem is likely not related to +=, I changed the title accordingly.

kaipartmann avatar Oct 22 '24 18:10 kaipartmann

Before Julia 1.7, I believe this would have done K += (k * X) * X', which also seems fast now.

After 3-arg *, this calls K += broadcast(*, k, X, X'), which is slow, like mwe1.

Strangely, using K += k .* X .* X' seems to be fast, like mwe3.

mcabbott avatar Oct 25 '24 18:10 mcabbott