julia icon indicating copy to clipboard operation
julia copied to clipboard

Regression in performance of FEM-code using AD in threaded loop in 1.12 vs 1.11.

Open KristofferC opened this issue 4 weeks ago • 5 comments

The code below runs the file at https://github.com/Ferrite-FEM/Ferrite.jl/blob/kc/landau_opt/docs/src/literate-gallery/landau.jl which has the option to run the assembly routine with Threads.@threads or not. It is using a bad style of parallelism with threadid but that is not the point here.

If we run this code on 1.11:

git clone https://github.com/Ferrite-FEM/Ferrite.jl/
cd Ferrite.jl 
git checkout kc/landau_opt
#### 1.11 ####

julia +1.11 --project=docs -e 'using Pkg; Pkg.update()'

# non-threaded loop
julia +1.11 --project=docs --threads=8 docs/src/literate-gallery/landau.jl

# F: 0.016159 seconds
# ∇F!: 0.065011 seconds
# ∇²F!: 1.169886 seconds (180.00 k allocations: 49.439 MiB, 0.07% gc time)
 # 9.461187 seconds (3.51 M allocations: 1.392 GiB, 1.08% gc time, 6.16% compilation time)

# threaded loop
RUN_THREADED=1 julia +1.10 --project=docs --threads=8 docs/src/literate-gallery/landau.jl

# F: 0.004370 seconds (1.81 k allocations: 188.125 KiB)
# ∇F!: 0.018842 seconds (1.81 k allocations: 188.125 KiB)
# ∇²F!: 0.262754 seconds (181.82 k allocations: 49.624 MiB, 0.21% gc time)
#  3.578371 seconds (3.54 M allocations: 1.395 GiB, 2.35% gc time, 15.19% compilation time)

We can make the following observations:

  • The amount allocated for the threaded and non-threaded loop is roughly the same
  • The overhead in allocations from F and ∇F! being called threaded is fixed and small.

Now, if we run this on 1.12:

#### 1.12 ####

julia +1.12 --project=docs -e 'using Pkg; Pkg.update()'

# non threaded loop
julia +1.12 --project=docs --threads=8 docs/src/literate-gallery/landau.jl

# F: 0.014978 seconds
# ∇F!: 0.064877 seconds
# ∇²F!: 1.213513 seconds (210.00 k allocations: 50.812 MiB, 0.07% gc time)
#  9.258766 seconds (2.01 M allocations: 1.317 GiB, 0.62% gc time, 3.71% compilation time)

# threaded loop
RUN_THREADED=1 julia +1.12 --project=docs --threads=8 docs/src/literate-gallery/landau.jl

# F: 0.018132 seconds (121.30 k allocations: 83.894 MiB, 24.68% gc time)
# ∇F!: 0.026937 seconds (91.30 k allocations: 83.436 MiB, 12.25% gc time)
# ∇²F!: 1.037706 seconds (11.70 M allocations: 6.379 GiB, 26.83% gc time)
#  7.667274 seconds (72.23 M allocations: 40.276 GiB, 22.67% gc time, 4.58% compilation time)

We can see the following:

  • The code allocates an abhorrent amount in the threaded case and the GC has to work a lot.
  • The overhead in allocations in the different functions is not constant.

I'll see if I can bisect something

KristofferC avatar Nov 25 '25 13:11 KristofferC

index 3c95aa40..a2cef855 100644
--- a/docs/src/literate-gallery/landau.jl
+++ b/docs/src/literate-gallery/landau.jl
@@ -136,8 +136,7 @@ function assemble_cell!(f, dofvector, dofhandler, cache, i)
     f(cache, eldofs)
 end
 
-if haskey(ENV, "RUN_THREADED")
-function assemble_model!(f::F, dofvector, model) where {F}
+function assemble_model_threads!(f::F, dofvector, model) where {F}
     dofhandler = model.dofhandler
     for indices in model.threadindices
         Threads.@threads for i in indices
@@ -146,8 +145,8 @@ function assemble_model!(f::F, dofvector, model) where {F}
         end
     end
 end
-else
-function assemble_model!(f::F, dofvector, model) where {F}
+
+function assemble_model_nothreads!(f::F, dofvector, model) where {F}
     dofhandler = model.dofhandler
     for indices in model.threadindices
         for i in indices
@@ -156,7 +155,16 @@ function assemble_model!(f::F, dofvector, model) where {F}
         end
     end
 end
+
+const RUN_THREADED = Ref{Bool}(false)
+function assemble_model!(args...)
+    if RUN_THREADED[]
+        return assemble_model_threads!(args...)
+    else
+        return assemble_model_nothreads!(args...)
+    end
 end
+
 # This calculates the total energy calculation of the grid
 function F(dofvector::Vector{T}, model) where {T}
     out = Threads.Atomic{T}(zero(T))
@@ -257,11 +265,15 @@ model_small = LandauModel(α, G, (2, 2, 2), left, right, element_potential)
 minimize!(model_small; show_trace=false)
 
 model = LandauModel(α, G, (50, 50, 2), left, right, element_potential)
+
 # save_landau("landauorig", model)
-@time minimize!(model)
-# save_landau("landaufinal", model)
 
+RUN_THREADED[] = false
+@time minimize!(model)
+RUN_THREADED[] = true
+@time minimize!(model)
 
+# save_landau("landaufinal", model)

If you apply this diff to your script you will get proper performance out of the threaded case on 1.12. It looks like there is an issue with functions that are first run/compiled in a thread. If you run the example first without threads, and then enable threads, it seems to pick up proper specializations of the functions--presumably because it is running code compiled during the single thread pass.

kwdye avatar Nov 25 '25 14:11 kwdye

Is it a captured variable causing a mess?

gbaraldi avatar Nov 25 '25 15:11 gbaraldi

According to @vtjnash this is kind of known in that when compiling, we will not block the other threads from trying to run code, which means they might run a kind of bad version of the code (if I understand things correctly). @xal-0 said he had some interest in maybe working on it.

KristofferC avatar Nov 25 '25 17:11 KristofferC

I tried generating (--trace-compile) the code for my sysimage with --threads=1 (and a single task).

The new sysimage compared to my old one generated with --threads=6 has quite different behaviour:

  • GC time increases around 34%, sometimes 100%.
  • It makes 4 times more allocations (GC_Num.poolalloc) , sometimes 14 times more

Despite this, the application total elapsed time does not change much with respect to the old sysimage.

Any suggestion?

dpinol avatar Dec 05 '25 18:12 dpinol

Leaving a comment here since I said I was indirectly working on solving this problem in #60031. The big problem with the current version of concurrent compilation is that we need to resolve the target of every invoke before we compile the LLVM module for a CodeInstance. In local-names-linking, out of necessity, we defer that until after compilation.

xal-0 avatar Dec 08 '25 18:12 xal-0