DFTK.jl icon indicating copy to clipboard operation
DFTK.jl copied to clipboard

Latency

Open antoine-levitt opened this issue 4 years ago • 11 comments

Could be good to run https://discourse.julialang.org/t/new-tools-for-reducing-compiler-latency/52882, although I would imagine that our latency cost is because of our deps, not because of us.

antoine-levitt avatar Jan 05 '21 15:01 antoine-levitt

Actually I think SnoopCompile will help quite a bit. Could also be super helpful for my new design in asedftk. I'm definitely going to take a look at that at some point.

mfherbst avatar Jan 05 '21 18:01 mfherbst

At least it'll tell us which packages slow us down and then we can bug them.

antoine-levitt avatar Jan 05 '21 19:01 antoine-levitt

I just tried the new 1.6 beta, a lot of things in general feel faster but time julia silicon.jl is actually slower, so there might be some scope for latency optimization in DFTK.

antoine-levitt avatar Jan 09 '21 20:01 antoine-levitt

Something is seriously wrong somewhere. Just taking out lobpcg_hyper_impl.jl with a simple test script

A = randn(10, 10); A = A+A'
X = randn(10, 2)
ret = LOBPCG(A, randn(10, 2))
println(ret.n_matvec)

has a second @time include of 4s, due to inference. This is on julia 1.6.1

antoine-levitt avatar Jun 06 '21 13:06 antoine-levitt

Playing with https://github.com/JuliaLang/julia/pull/41612:

julia> using DFTK
    125.7 ms  FillArrays
    965.6 ms  ArrayLayouts
    853.7 ms  StaticArrays
    859.2 ms  DiffResults
      4.0 ms  Compat
      6.0 ms  Ratios
      1.0 ms  Requires
      0.9 ms  Adapt
      7.9 ms  WoodburyMatrices
     11.6 ms  AxisAlgorithms
     57.2 ms  ChainRulesCore
     55.1 ms  OffsetArrays
    198.2 ms  Interpolations
     86.4 ms  DataStructures
      0.7 ms  NaNMath
      0.3 ms  UnPack
      1.4 ms  DataAPI
      0.4 ms  StatsAPI
      0.9 ms  SortingAlgorithms
     11.3 ms  Missings
     46.0 ms  StatsBase
      1.8 ms  PositiveFactorizations
      1.8 ms  Parameters
      4.3 ms  DocStringExtensions
      0.6 ms  CompilerSupportLibraries_jll
      3.1 ms  LogExpFunctions
      0.6 ms  Preferences
      2.8 ms  JLLWrappers
     12.6 ms  OpenSpecFun_jll
     33.3 ms  SpecialFunctions
     54.1 ms  MacroTools
     57.5 ms  CommonSubexpressions
      1.1 ms  DiffRules
    193.2 ms  ForwardDiff
      0.3 ms  IfElse
     48.9 ms  Static
    198.5 ms  ArrayInterface
     14.6 ms  FiniteDiff
      7.8 ms  NLSolversBase
    478.1 ms  LineSearches
    564.8 ms  Optim
      0.5 ms  ExprTools
     26.1 ms  RecipesBase
      2.2 ms  Mocking
     67.1 ms  TimeZones
    102.5 ms  Intervals
    311.7 ms  MutableArithmetics
   1202.4 ms  Polynomials
    461.6 ms  Libxc_jll
    466.3 ms  Libxc
      6.7 ms  Distances
      0.3 ms  VersionParsing
     13.2 ms  Conda
   1476.3 ms  PyCall
     10.2 ms  AbstractFFTs
      4.3 ms  MPICH_jll
     52.1 ms  MPI
      0.6 ms  Reexport
    297.0 ms  LinearMaps
      0.5 ms  CommonSolve
      1.8 ms  ConstructionBase
     47.6 ms  TimerOutputs
    195.7 ms  Unitful
   1121.0 ms  PeriodicTable
     36.0 ms  IterTools
      3.1 ms  ProgressMeter
     20.5 ms  IterativeSolvers
      4.9 ms  NLsolve
    333.1 ms  UnitfulAtomic
      1.3 ms  FFTW_jll
    274.1 ms  FFTW
    193.0 ms  BlockArrays
      1.4 ms  spglib_jll
      9.5 ms  Roots
   8915.1 ms  DFTK

antoine-levitt avatar Jul 16 '21 12:07 antoine-levitt

Hmm once #483 is in we can probably make PyCall optional. That already is quite a large chunk. As for PeriodicTable that is also mostly for convenience, but I wonder why it takes so long to load. It is a pretty small package.

mfherbst avatar Jul 17 '21 12:07 mfherbst

But 9 seconds is really not great. We should definitely work on that.

mfherbst avatar Jul 17 '21 12:07 mfherbst

The load time is not even that much of an issue, the time to first SCF feels longer in my experience. The LOBPCG stuff is apparently responsible for some of it (though I have no idea why)

antoine-levitt avatar Jul 17 '21 13:07 antoine-levitt

https://discourse.julialang.org/t/ann-new-package-snoopprecompile/84778/4 sounds like exactly what we need

antoine-levitt avatar Jul 30 '22 11:07 antoine-levitt

No it does not help. I tried it yesterday. We have too many invalidations. We need to fix those first.

mfherbst avatar Jul 30 '22 11:07 mfherbst

Pretty sure that's not our issue. They always talk about that and it's one source of latency, but I don't think that's the one we suffer from. At least I couldn't find anything problematic when I took at look last time

antoine-levitt avatar Jul 30 '22 12:07 antoine-levitt

List of things to try:

  • see if removing nlsolve helped (thanks @epolack )
  • once Julia PR 47184 is in, add precompilation (ie run an example computation on precompilation)
  • transition to https://pkgdocs.julialang.org/dev/creating-packages/#Conditional-loading-of-code-in-packages-(Extensions) instead of Requires for 1.9
  • fix the LOBPCG slowness described above

antoine-levitt avatar Dec 14 '22 15:12 antoine-levitt

The julia beta released today has the latency improvements, which look pretty spectacular. Not at a computer for a while, but very curious to see what it looks like for us

antoine-levitt avatar Dec 29 '22 20:12 antoine-levitt

I just tested this expecting some improvements. But it seems there is a regression. I tested the first example in the README and measured the time to first SCF:

using DFTK, Unitful, UnitfulAtomic

# 1. Define lattice and atomic positions
a = 5.431u"angstrom"          # Silicon lattice constant
lattice = a / 2 * [[0 1 1.];  # Silicon lattice vectors
                   [1 0 1.];  # specified column by column
                   [1 1 0.]];

# Load HGH pseudopotential for Silicon
Si = ElementPsp(:Si, psp=load_psp("hgh/lda/Si-q4"))

# Specify type and positions of atoms
atoms     = [Si, Si]
positions = [ones(3)/8, -ones(3)/8]

# 2. Select model and basis
model = model_LDA(lattice, atoms, positions)
kgrid = [4, 4, 4]     # k-point grid (Regular Monkhorst-Pack grid)
Ecut = 7              # kinetic energy cutoff
# Ecut = 190.5u"eV"  # Could also use eV or other energy-compatible units
basis = PlaneWaveBasis(model; Ecut, kgrid)
# Note the implicit passing of keyword arguments here:
# this is equivalent to PlaneWaveBasis(model; Ecut=Ecut, kgrid=kgrid)

# 3. Run the SCF procedure to obtain the ground state
@time scfres = self_consistent_field(basis, tol=1e-5);

Following are the results for two runs each in 1.8.4 and 1.9.0-beta2.

1.8.4:

julia> Pkg.status()
Status `~/.julia/environments/dftk/Project.toml`
  [acf6eb54] DFTK v0.6.0 `~/GitProjects/DFTK`

julia> versioninfo()
Julia Version 1.8.4
Commit 00177ebc4fc (2022-12-23 21:32 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, nehalem)
  Threads: 4 on 8 virtual cores
Environment:
  JULIA_NUM_THREADS = 4
  JULIA_PKG_DEVDIR = /home/rashid/GitProjects
  JULIA_NUM_PRECOMPILE_TASKS = 6
  JULIA_CONDAPKG_VERBOSITY = 0
  JULIA_CONDAPKG_EXE = conda

Run 1: 70.294600 seconds (128.09 M allocations: 6.876 GiB, 5.40% gc time, 98.92% compilation time) Run 2: 72.600813 seconds (128.07 M allocations: 6.875 GiB, 4.58% gc time, 98.76% compilation time)

1.9.0-beta2:

julia> Pkg.status()
Status `~/.julia/environments/dftk/Project.toml`
  [acf6eb54] DFTK v0.6.0 `~/GitProjects/DFTK`

julia> versioninfo()
Julia Version 1.9.0-beta2
Commit 7daffeecb8c (2022-12-29 07:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, nehalem)
  Threads: 4 on 8 virtual cores
Environment:
  JULIA_NUM_THREADS = 4
  JULIA_PKG_DEVDIR = /home/rashid/GitProjects
  JULIA_NUM_PRECOMPILE_TASKS = 6
  JULIA_CONDAPKG_VERBOSITY = 0
  JULIA_CONDAPKG_EXE = conda

Run 1: 81.235901 seconds (80.53 M allocations: 5.735 GiB, 7.46% gc time, 168.66% compilation time) Run 2: 81.807179 seconds (80.53 M allocations: 5.735 GiB, 6.91% gc time, 167.89% compilation time)

rashidrafeek avatar Dec 30 '22 10:12 rashidrafeek

Oh :( thanks for testing though. Next step is to put this snippet in DFTK.jl so it gets precompiled... Otherwise something must be blocking precompolation, maybe the timeroutputs macro...

antoine-levitt avatar Dec 30 '22 11:12 antoine-levitt

From what I read the new changes introduce a load time regression, the trade-off being a more efficient precompilation, so these numbers are not that crazy, they just mean that precompilation is not working for us

antoine-levitt avatar Dec 30 '22 11:12 antoine-levitt

The following diff adds precompilation to DFTK:

diff --git a/Project.toml b/Project.toml
index 284eacdc..94b15148 100644
--- a/Project.toml
+++ b/Project.toml
@@ -35,6 +35,7 @@ PseudoPotentialIO = "cb339c56-07fa-4cb2-923a-142469552264"
 Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
 Requires = "ae029012-a4dd-5104-9daa-d747884805df"
 Roots = "f2b01f46-fcfa-551c-844a-d8ac1e96c665"
+SnoopPrecompile = "66db9d55-30c0-4569-8b51-7e840670fc0c"
 SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"
 SpecialFunctions = "276daf66-3868-5448-9aa4-cd146d93841b"
 Spglib = "f761d5c5-86db-4880-b97f-9680a7cccfb5"
diff --git a/src/DFTK.jl b/src/DFTK.jl
index 09416da9..7985d9a1 100644
--- a/src/DFTK.jl
+++ b/src/DFTK.jl
@@ -232,4 +232,19 @@ function __init__()
     end
 end
 
+using SnoopPrecompile
+
+begin
+a = 10.26  # Silicon lattice constant in Bohr
+lattice = a / 2 * [[0 1 1.];
+                   [1 0 1.];
+                   [1 1 0.]]
+Si = ElementPsp(:Si, psp=load_psp("hgh/lda/Si-q4"))
+atoms     = [Si, Si]
+positions = [ones(3)/8, -ones(3)/8]
+
+model = model_LDA(lattice, atoms, positions)
+basis = PlaneWaveBasis(model; Ecut=15, kgrid=[4, 4, 4])
+scfres = self_consistent_field(basis, tol=1e-8)
+end
 end # module DFTK

Some numbers on 1.9beta

Without precompilation code 1 dependency successfully precompiled in 5 seconds. 111 already precompiled. 42.652828 seconds (122.06 M allocations: 8.922 GiB, 7.15% gc time, 141.72% compilation time: 1% of which was recompilation)

With precompilation code 1 dependency successfully precompiled in 59 seconds. 103 already precompiled. 8.373334 seconds (18.88 M allocations: 1.944 GiB, 4.29% gc time, 57.40% compilation time: 59% of which was recompilation)

So all in all a very nice win for users. Not so much for developers though. I asked on the julia slack, and people recommend turning off precompilation of deved packages. Should we just merge this diff, at least when 1.9 is released? @rashidrafeek can you post your numbers with this diff, since they're much slower than mine?

antoine-levitt avatar Jan 04 '23 13:01 antoine-levitt

Its a lot better after applying this patch! Number from my system with 1.9.0-beta2:

Without precompilation code:

  1 dependency successfully precompiled in 13 seconds. 103 already precompiled.
 78.601886 seconds (80.51 M allocations: 5.735 GiB, 6.91% gc time, 169.27% compilation time)

With precompilation code:

  1 dependency successfully precompiled in 167 seconds. 103 already precompiled.
 10.451499 seconds (10.32 M allocations: 875.533 MiB, 3.96% gc time, 149.20% compilation time)

rashidrafeek avatar Jan 04 '23 14:01 rashidrafeek

Can you try the the same patch but with @precompile_all_calls before the begin?

antoine-levitt avatar Jan 04 '23 14:01 antoine-levitt

Hmm that's still pretty impressive overall. Nice step in the right direction!

mfherbst avatar Jan 04 '23 14:01 mfherbst

Can you try the the same patch but with @precompile_all_calls before the begin?

  1 dependency successfully precompiled in 170 seconds. 103 already precompiled.
  1.412220 seconds (467.34 k allocations: 216.655 MiB, 4.57% gc time, 41.23% compilation time)

:astonished:

rashidrafeek avatar Jan 04 '23 14:01 rashidrafeek

Huh I had more than this. Did you do both steps in the same session?

antoine-levitt avatar Jan 04 '23 14:01 antoine-levitt

Meaning, both precompilation and running code? I tried in the same session as well as after restarting julia, as I was not sure of the result. Both gave similar results. A result after restarting julia:

  1.186028 seconds (466.23 k allocations: 216.232 MiB, 3.85% gc time, 38.92% compilation time)

rashidrafeek avatar Jan 04 '23 14:01 rashidrafeek

Oh wow. OK fine, let's just do it and eat the precompilation cost. I'll add a flag so devs can disable it.

antoine-levitt avatar Jan 04 '23 14:01 antoine-levitt

Ah I see where the discrepancy between your numbers and mine come from, you're measuring the last @time in the above, I was measuring the whole thing (using, model and basis construction, self_consistent_field). So we get similar relative timings but your machine is slower than mine. (Thanks @giordano on slack!)

antoine-levitt avatar Jan 04 '23 19:01 antoine-levitt