Flux.jl icon indicating copy to clipboard operation
Flux.jl copied to clipboard

Very slow "using Flux"

Open mattiadg opened this issue 2 years ago • 17 comments

Hi, I'm new here. I would like to use Flux but everytime I run using Flux it takes 23 seconds on my i5 CPU. The following is the full output:

@time using Flux  
23.719183 seconds (22.00 M allocations: 1.437 GiB, 5.05% gc time, 43.13% compilation time)

Then, it becomes considerably faster the second time

@time using Flux  
0.467162 seconds (888.83 k allocations: 47.272 MiB, 99.61% compilation time)

I found #1283 but it is 2 years old and I didn't really find a solution there. Is such a slow startup expected?

mattiadg avatar May 08 '22 00:05 mattiadg

That is indeed slower than expected. We'll want to see your versioninfo and what version of Flux you're trying to load at a minimum, but assuming your CPU is rather recent I would suspect an out-of-date Julia version to be the culprit.

ToucheSir avatar May 08 '22 01:05 ToucheSir

My cpu is some years old now, but I hope the problem is not that

> versioninfo()

Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

and Flux v0.12.10

mattiadg avatar May 08 '22 07:05 mattiadg

I don't think so, but your operating system might. Can you try timing without any AV or related programs running?

ToucheSir avatar May 08 '22 23:05 ToucheSir

If you use the Julia 1.9 dev version, then you can use the fancy @time_imports function to get a profile of package loading time, e.g.,

julia> versioninfo()
Julia Version 1.9.0-DEV.351
Commit 385762b444 (2022-04-08 21:50 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 24 × 12th Gen Intel(R) Core(TM) i9-12900K
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, goldmont)
  Threads: 1 on 24 virtual cores
time_import result
julia> @time_imports using Flux
      8.3 ms    ┌ MacroTools
     15.9 ms  ┌ ZygoteRules
      0.2 ms  ┌ DefineSingletons
      0.1 ms    ┌ IteratorInterfaceExtensions
      0.3 ms  ┌ TableTraits
      0.6 ms    ┌ Compat
      4.5 ms    ┌ OrderedCollections
      0.3 ms    ┌ Requires
      0.1 ms    ┌ DataValueInterfaces
      0.7 ms      ┌ DataAPI
      8.1 ms    ┌ Tables
      0.5 ms    ┌ ConstructionBase
     18.7 ms    ┌ Setfield
     10.5 ms    ┌ InitialValues
     75.0 ms  ┌ BangBang
    329.3 ms    ┌ StaticArrays
    807.4 ms  ┌ DiffResults
      9.0 ms  ┌ FunctionWrappers
     41.8 ms  ┌ DataStructures
      0.4 ms  ┌ NaNMath
      0.1 ms    ┌ CompositionsBase
     34.3 ms  ┌ Accessors
      1.4 ms  ┌ ContextVariablesX
      0.2 ms    ┌ RealDot
      3.2 ms    ┌ IrrationalConstants
     48.0 ms    ┌ ChainRulesCore
     63.0 ms  ┌ ChainRules
      5.5 ms    ┌ AbstractFFTs
     35.3 ms        ┌ Preferences
     35.9 ms      ┌ JLLWrappers
    171.4 ms    ┌ LLVMExtra_jll
      0.3 ms    ┌ Reexport
     16.4 ms      ┌ RandomNumbers
     19.7 ms    ┌ Random123
      2.8 ms        ┌ CEnum
    133.3 ms      ┌ LLVM
      0.5 ms      ┌ Adapt
    600.2 ms    ┌ GPUArrays
      0.5 ms    ┌ ExprTools
     42.8 ms    ┌ TimerOutputs
      3.9 ms    ┌ BFloat16s
    278.0 ms    ┌ GPUCompiler
     20.3 ms    ┌ NNlib
    726.3 ms    ┌ CUDA
   2080.8 ms  ┌ NNlibCUDA
    115.0 ms  ┌ FillArrays
      0.6 ms  ┌ ChangesOfVariables
      0.8 ms    ┌ InverseFunctions
      4.1 ms    ┌ DocStringExtensions
      0.4 ms    ┌ StatsAPI
      0.4 ms    ┌ SortingAlgorithms
      0.7 ms    ┌ LogExpFunctions
      6.2 ms    ┌ Missings
     31.2 ms  ┌ StatsBase
      9.4 ms    ┌ MicroCollections
      0.4 ms    ┌ ArgCheck
      6.2 ms    ┌ SplittablesBase
     27.5 ms    ┌ Baselet
     93.7 ms  ┌ Transducers
      0.4 ms  ┌ OpenLibm_jll
    193.0 ms  ┌ MLStyle
      0.6 ms      ┌ OpenSpecFun_jll
    115.8 ms    ┌ SpecialFunctions
      0.5 ms    ┌ CommonSubexpressions
      0.7 ms    ┌ DiffRules
    417.8 ms  ┌ ForwardDiff
      0.2 ms  ┌ IfElse
      1.6 ms  ┌ PrettyPrint
     29.5 ms    ┌ Static
    132.2 ms  ┌ ArrayInterface
      2.8 ms  ┌ ProgressLogging
      6.5 ms  ┌ ShowCases
    349.2 ms  ┌ FoldsThreads
      0.3 ms  ┌ FLoopsBase
      0.5 ms  ┌ NameResolution
      3.5 ms  ┌ JuliaVariables
      6.8 ms  ┌ FLoops
      1.5 ms  ┌ Functors
     10.4 ms  ┌ IRTools
   1714.0 ms  ┌ Zygote
      7.8 ms  ┌ Optimisers
      5.6 ms  ┌ MLUtils
   6324.0 ms  Flux

johnnychen94 avatar May 09 '22 05:05 johnnychen94

I had wsl running, but now that I closed it it's even worse

> @time using Flux
 44.074991 seconds (20.31 M allocations: 1.349 GiB, 4.12% gc time, 28.61% compilation time)

Should I use a pre-release version of Julia?

mattiadg avatar May 09 '22 07:05 mattiadg

The presence/absence of WSL seems like a red herring to me. Just to confirm, these times are sans AV or any other program that might try to sniff actively running executables?

ToucheSir avatar May 09 '22 14:05 ToucheSir

I tried without AV but the time is again around 40 seconds. I don't notice strange performance problems for any other running software

mattiadg avatar May 09 '22 20:05 mattiadg

As a reference point, here's what I get on an older xeon (from 2012, while it seems i7-6500U is from 2015) with latest Flux.

tl;dr is that I don't think that 20s implies something is wrong on your end, sadly.

julia> @time using Flux  # first time, after updating some packages
[ Info: Precompiling Flux [587475ba-b771-5e3f-ad9e-33799f191a9c]
134.147325 seconds (20.18 M allocations: 1.405 GiB, 0.61% gc time, 7.73% compilation time: 49% of which was recompilation)

# restart Julia, no changes

julia> @time using Flux
 22.720761 seconds (20.17 M allocations: 1.404 GiB, 3.77% gc time, 45.13% compilation time: 49% of which was recompilation)

julia> @time using CUDA  # after re-starting again
  8.594542 seconds (9.48 M allocations: 705.886 MiB, 3.71% gc time, 26.78% compilation time: 85% of which was recompilation)

julia> versioninfo()  # cyclops, an elderly xeon
Julia Version 1.9.0-DEV.454
Commit 6026b5ff2a (2022-04-29 12:16 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 12 × Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, broadwell)
  Threads: 4 on 12 virtual cores

Julia 1.7.2 is a little quicker:

julia> @time using Flux
 18.600435 seconds (23.27 M allocations: 1.445 GiB, 6.86% gc time, 47.83% compilation time)

Running the fancy new macro on 1.9, and trimming its output a bit, here's who the culprits are:

julia> @time_imports using Flux
     25.7 ms    ┌ MacroTools
     44.8 ms  ┌ ZygoteRules 40.40% compilation time
    227.7 ms  ┌ BangBang 29.65% compilation time (6% recompilation)
   1470.1 ms    ┌ StaticArrays
   3226.1 ms  ┌ DiffResults 53.16% compilation time (85% recompilation)
    174.0 ms  ┌ DataStructures
    211.8 ms    ┌ ChainRulesCore
    271.2 ms  ┌ ChainRules
    545.6 ms    ┌ LLVMExtra_jll 85.64% compilation time (99% recompilation)
    429.4 ms      ┌ LLVM 32.98% compilation time
   2608.8 ms    ┌ GPUArrays 5.43% compilation time
    840.0 ms    ┌ GPUCompiler 33.01% compilation time (94% recompilation)
   2965.1 ms    ┌ CUDA 0.69% compilation time
   8070.0 ms  ┌ NNlibCUDA 20.35% compilation time (84% recompilation)
    479.0 ms  ┌ FillArrays
      2.8 ms  ┌ ChangesOfVariables
    111.6 ms  ┌ StatsBase 5.19% compilation time
    320.1 ms  ┌ Transducers 12.56% compilation time
      0.9 ms  ┌ OpenLibm_jll
   1085.4 ms  ┌ MLStyle
   1211.3 ms  ┌ ForwardDiff 61.29% compilation time (85% recompilation)
     97.5 ms    ┌ Static
    422.4 ms  ┌ ArrayInterface 49.69% compilation time
      8.4 ms  ┌ ProgressLogging
     22.6 ms  ┌ ShowCases
   1281.1 ms  ┌ FoldsThreads 92.11% compilation time
      0.8 ms  ┌ FLoopsBase
      1.2 ms  ┌ NameResolution
      9.3 ms  ┌ JuliaVariables
     20.2 ms  ┌ FLoops
      3.7 ms  ┌ Functors
     30.8 ms  ┌ IRTools
   5583.3 ms  ┌ Zygote 92.38% compilation time (37% recompilation)
     29.0 ms  ┌ Optimisers
     18.0 ms  ┌ MLUtils
  23181.4 ms  Flux 46.75% compilation time (50% recompilation)

NNlibCUDA and DiffResults jump out as surprises, is there something wrong with these? Xref https://github.com/JuliaDiff/ForwardDiff.jl/issues/518 and https://github.com/JuliaArrays/StaticArrays.jl/issues/1023 for recent discussion.

Surely we could also trim a few of these, do we really need FoldsThreads, MLStyle?

mcabbott avatar May 09 '22 20:05 mcabbott

Is there a recommended set of hardware for running Flux, then?

Il giorno lun 9 mag 2022 alle ore 22:52 Michael Abbott < @.***> ha scritto:

As a reference point, here's what I get on an older xeon (from 2012, while it seems i7-6500U is from 2015) with latest Flux. tl;dr is that I don't think that 20s implies something is wrong on your end, sadly.

julia> @time using Flux # first time, after updating some packages

[ Info: Precompiling Flux [587475ba-b771-5e3f-ad9e-33799f191a9c]

134.147325 seconds (20.18 M allocations: 1.405 GiB, 0.61% gc time, 7.73% compilation time: 49% of which was recompilation)

restart Julia, no changes

julia> @time using Flux

22.720761 seconds (20.17 M allocations: 1.404 GiB, 3.77% gc time, 45.13% compilation time: 49% of which was recompilation)

julia> @time using CUDA # after re-starting again

8.594542 seconds (9.48 M allocations: 705.886 MiB, 3.71% gc time, 26.78% compilation time: 85% of which was recompilation)

julia> versioninfo() # cyclops, an elderly xeon

Julia Version 1.9.0-DEV.454

Commit 6026b5ff2a (2022-04-29 12:16 UTC)

Platform Info:

OS: Linux (x86_64-linux-gnu)

CPU: 12 × Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz

WORD_SIZE: 64

LIBM: libopenlibm

LLVM: libLLVM-13.0.1 (ORCJIT, broadwell)

Threads: 4 on 12 virtual cores

Julia 1.7.2 is a little quicker:

julia> @time using Flux

18.600435 seconds (23.27 M allocations: 1.445 GiB, 6.86% gc time, 47.83% compilation time)

— Reply to this email directly, view it on GitHub https://github.com/FluxML/Flux.jl/issues/1961#issuecomment-1121570076, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIX2CWRXUDIED5CUP3DVJF3JVANCNFSM5VLDVA5Q . You are receiving this because you authored the thread.Message ID: @.***>

-- Mattia Antonino Di Gangi, PhD Speech Scientist, AppTek Gmbh. Aachen, NRW, Germany

mattiadg avatar May 09 '22 20:05 mattiadg

Those are slow, but I would expect a Skylake chip (even a mobile one) to run faster. For example, here is my Haswell machine:

julia> @time using Flux
  7.409032 seconds (22.81 M allocations: 1.422 GiB, 7.05% gc time, 48.99% compilation time)

julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, haswell)

That's on spinning rust, not a SSD like I would expect to see in a laptop with the kind of CPU @mattiadg has. Since there's no parallelism here, I would not expect a 1.6x clock speed increase and a microarchitectural disadvantage to lead to a 3x (let alone 6x) speed increase. Hence I suspect some sort of OS or other software interference. Have you checked whether your machine is on spec in other single-threaded benchmarks? The only other factors I can think of are generic troubleshooting ones like making sure RAM/disk aren't at their limits and power settings are cranked all the way up (which IIRC makes a huge difference on U-series chips).

ToucheSir avatar May 09 '22 21:05 ToucheSir

When all the open apps, including vs code and the AV, and set the battery to max performance, it got down to 11s. Without max performance it was at 19. Maybe the rest comes from the old cpu. Isn't it able to compile on multiple cores?

Il giorno lun 9 mag 2022 alle ore 23:07 Brian Chen @.***> ha scritto:

Those are slow, but I would expect a Skylake chip (even a mobile one) to run faster. For example, here is my Haswell machine:

julia> @time using Flux 7.409032 seconds (22.81 M allocations: 1.422 GiB, 7.05% gc time, 48.99% compilation time)

julia> versioninfo() Julia Version 1.7.2 Commit bf53498635 (2022-02-06 15:21 UTC) Platform Info: OS: Linux (x86_64-pc-linux-gnu) CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-12.0.1 (ORCJIT, haswell)

That's on spinning rust, not a SSD like I would expect to see in a laptop with the kind of CPU @mattiadg https://github.com/mattiadg has. Since there's no parallelism here, I would not expect a 1.6x clock speed increase and a microarchitectural disadvantage to lead to a 3x (let alone 6x) speed increase. Hence I suspect some sort of OS or other software interference. Have you checked whether your machine is on spec in other single-threaded benchmarks? The only other factors I can think of are generic troubleshooting ones like making sure RAM/disk aren't at their limits and power settings are cranked all the way up (which IIRC makes a huge difference on U-series chips).

— Reply to this email directly, view it on GitHub https://github.com/FluxML/Flux.jl/issues/1961#issuecomment-1121581338, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDITQ7MPN4F36DG52FUTVJF47RANCNFSM5VLDVA5Q . You are receiving this because you were mentioned.Message ID: <FluxML/Flux .@.***>

-- Mattia Antonino Di Gangi, PhD Speech Scientist, AppTek Gmbh. Aachen, NRW, Germany

mattiadg avatar May 09 '22 21:05 mattiadg

Great! 11s looks more reasonable and is hopefully more tolerable. The AV may have been a red herring, which is also great.

Maybe the rest comes from the old cpu. Isn't it able to compile on multiple cores?

I do not believe the import process in Julia can run multi-threaded, so probably not.

ToucheSir avatar May 09 '22 21:05 ToucheSir

NNlibCUDA and DiffResults jump out as surprises, is there something wrong with these? Xref JuliaDiff/ForwardDiff.jl#518 and JuliaArrays/StaticArrays.jl#1023 for recent discussion.

I asked about this on Slack and the conclusion was rather unsatisfying. In short (@IanButterworth can correct me if I'm wrong), just because@time_imports assigns a particular (re)compilation time to a dep doesn't mean it's personally responsible. Ref. https://github.com/JuliaLang/julia/pull/45064#issuecomment-1107615994, tracking provenance turns out to be non-trivial. For this specific case, SnoopCompile reports no invalidations caused by NNlibCUDA, so most (if not all) of the recompilation reported for it is coming from deps or transitive deps.

Surely we could also trim a few of these, do we really need FoldsThreads, MLStyle?

Those are deps of MLUtils. I think the former is pretty fundamental, but the latter is only used in one place. @lorenzoh would it be easy to desugar that floops macro so that the dep can be dropped?

ToucheSir avatar May 10 '22 17:05 ToucheSir

Re NNlib, with more coffee, it's closer to the sum of the unavoidable packages above it than I first thought. So maybe it's innocent!

Would be great if all GPU stuff could be skipped when not necessary, but maybe not easy.

mcabbott avatar May 10 '22 17:05 mcabbott

Those are deps of MLUtils. I think the former is pretty fundamental, but the latter is only used in one place. @lorenzoh would it be easy to desugar that floops macro so that the dep can be dropped?

Surely possible to desugar and use base Transducers.jl + FoldsThreads.jl, though I wonder whether those two may not be the main compilation problem.

lorenzoh avatar May 12 '22 12:05 lorenzoh

Would be great if all GPU stuff could be skipped when not necessary, but maybe not easy.

Not sure if this is the right application for this, but for a new approach to loading a module only when it is actually used, see https://github.com/johnnychen94/LazyModules.jl

lorenzoh avatar May 12 '22 12:05 lorenzoh

For CUDA.jl it will almost certainly hit the world-age issue because CuArray is an alien type to the caller world-age. See the discussion in [ANN] Accouncing LazyModules.jl: delay the heavy dependency loading to its first use

IIUC, CUDA.jl must be eagerly loaded. NNlibCUDA.jl, on the other hand, might be okay to be lazily loaded. But this trick introduces a ~80ns overhead for each function call, so if it's a trivial computation, it's not ideal.

The ideal use case of LazyModules is some standalone dependency that doesn't overlap or deeply coupled with the main codebase. ImageIO is a perfect example of it https://github.com/JuliaIO/ImageIO.jl/pull/49.

johnnychen94 avatar May 12 '22 13:05 johnnychen94

The future of this is probably https://github.com/JuliaLang/julia/pull/47695 . That PR has nice instructions... anyone want to make a branch using Requires.jl and see how much slower this actually is? (Since that's how pre-47695 versions of Julia would then have to use Flux.)

mcabbott avatar Dec 08 '22 01:12 mcabbott

Back compatability doesn't mean you have to use Requires. You can have the package as-is, and using extensions. That's explained in the Pkg docs.

It would be great to make CUDA weak!

IanButterworth avatar Dec 08 '22 02:12 IanButterworth

Ah, just load it unconditionally on old Julia, and via extensions on new. That sounds even better, it won't get worse for anyone.

mcabbott avatar Dec 08 '22 02:12 mcabbott

no longer relevant on julia 1.9 + extensions

CarloLucibello avatar Jul 13 '23 14:07 CarloLucibello