need precompile statements re-enabled for `addprocs` (with PR)
As discovered in https://discourse.julialang.org/t/help-with-binary-trees-benchmark-games-example/37307/13
❯ hyperfine -w1 "julia -p4 -E 'using Distributed; nprocs()'" "julia -E 'using Distributed; addprocs(); nprocs()'"
Benchmark JuliaLang/julia#1: julia -p4 -E 'using Distributed; nprocs()'
Time (mean ± σ): 2.040 s ± 0.010 s [User: 5.563 s, System: 0.773 s]
Range (min … max): 2.024 s … 2.054 s 10 runs
Benchmark JuliaLang/julia#2: julia -E 'using Distributed; addprocs(); nprocs()'
Time (mean ± σ): 1.785 s ± 0.014 s [User: 5.337 s, System: 0.756 s]
Range (min … max): 1.765 s … 1.816 s 10 runs
Summary
'julia -E 'using Distributed; addprocs(); nprocs()'' ran
1.14 ± 0.01 times faster than 'julia -p4 -E 'using Distributed; nprocs()''
Is there a reason spawning the extra processes with addprocs() is necessarily faster than spawning them with -p command-line argument?
Probably because theh addprocs version is already compiled; https://github.com/JuliaLang/julia/blob/0c284839fef6c8c153edc01fddfa37a9f5ac6752/contrib/generate_precompile.jl#L44-L45.
@fredrikekre did you close because there's no way to get similar speed for -p4?
It doesn't seem like this should have been closed. It should be as fast, and -p needed for it to be in the hands of the user, not programmer. See also: https://github.com/JuliaLang/julia/issues/35830#issuecomment-626825539
Should that issue be closed and this one opened then?
No, keep both open. Mine is not a dup (about scalability), while slightly different, the cause may or may not be the same.
First, I saw no difference, for this issue, on Julia 1.0 using defaults, nor on most recent ASSUMING these settings only:
$ hyperfine -w1 "~/julia-1.6.0-DEV-8f512f3f6d/bin/julia --compile=min -O0 --startup-file=no -E 'using Distributed; addprocs(4);'"
Benchmark JuliaLang/julia#1: ~/julia-1.6.0-DEV-8f512f3f6d/bin/julia --compile=min -O0 --startup-file=no -E 'using Distributed; addprocs(4);'
Time (mean ± σ): 1.320 s ± 0.011 s [User: 3.226 s, System: 2.114 s]
Range (min … max): 1.304 s … 1.333 s 10 runs
$ hyperfine -w1 "~/julia-1.6.0-DEV-8f512f3f6d/bin/julia -p4 --compile=min --startup-file=no -O0 -E ''"
Benchmark JuliaLang/julia#1: ~/julia-1.6.0-DEV-8f512f3f6d/bin/julia -p4 --compile=min --startup-file=no -O0 -E ''
Time (mean ± σ): 1.323 s ± 0.008 s [User: 3.259 s, System: 2.020 s]
Range (min … max): 1.309 s … 1.335 s 10 runs
For default settings, there is a difference, and even with -O0 min..max ranges do not overlap, so as I've seen that setting eliminate invalidations, I would say those are implicated?
Now performance is switched, so problem solved!
vtjnash@deepsea4:~/julia$ hyperfine -w1 "./julia -p4 -E 'using Distributed; nprocs()'" "./julia -E 'using Distributed; addprocs(); nprocs()'"
Benchmark 1: ./julia -p4 -E 'using Distributed; nprocs()'
Time (mean ± σ): 8.952 s ± 1.129 s [User: 26.344 s, System: 0.740 s]
Range (min … max): 8.058 s … 10.398 s 10 runs
Warning: The first benchmarking run for this command was significantly slower than the rest (10.222 s). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run.
Benchmark 2: ./julia -E 'using Distributed; addprocs(); nprocs()'
Time (mean ± σ): 14.585 s ± 0.315 s [User: 62.846 s, System: 2.424 s]
Range (min … max): 14.057 s … 14.948 s 10 runs
Summary
'./julia -p4 -E 'using Distributed; nprocs()'' ran
1.63 ± 0.21 times faster than './julia -E 'using Distributed; addprocs(); nprocs()''
Clearly needs more precompile statements, now that Distributed is a separate stdlib that is much more reasonable then when it was included in the default image.
Code at https://github.com/JuliaLang/julia/pull/42156
@KristofferC Should we go ahead and enable precompile?