julia
julia copied to clipboard
Limit initial OpenBLAS thread count
We set OpenBLAS's initial thread count to 1 to prevent runaway allocation within OpenBLAS's initial thread startup. LinearAlgebra will later call BLAS.set_num_threads() to the actual value we require.
Also GOTO_NUM_THREADS? And I think you can recover the test for this from https://github.com/JuliaLang/julia/pull/42442/files#diff-e299fbf3f1935b1bb9e7ef61e4644dda078ae51fab7559caba8da03817943c9b
I think you can recover the test for this from https://github.com/JuliaLang/julia/pull/42442/files#diff-e299fbf3f1935b1bb9e7ef61e4644dda078ae51fab7559caba8da03817943c9b
I don't see how that is testing this functionality; that's testing that we limit the number of threads to a maximum of 8, but that's not what we do anymore; and in fact, our logic is much closer to what OpenBLAS itself natively does, so it's difficult to test that this is working.
Ah, true. Now that we are later also explicitly changing it, we would need to do that check in __init__, before we set it.
LinearAlgebra.__init__ only calls set_num_threads if !haskey(ENV, "OPENBLAS_NUM_THREADS"); does this interfere with that?
No; so what happens is, if a thread count is already set, neither this logic nor the LinearAlgebra __init__() logic messes with it.
I think I agree with Jeff that this will break LinearAlgebra.__init__, since it will be set now
Oh, of course, my bad. I totally misunderstood the question, and confused myself.
One possibility is to unset the environment variable after dlopen() has returned, but I'm not crazy about that idea. What do you think, Jameson?
I would perhaps rather just patch openblas to default to 1 thread at startup, though this might harm other apps that use it
Okay, I'll just add a short-circuit to OpenBLAS's default thread calculation then. X-ref: https://github.com/JuliaPackaging/Yggdrasil/pull/5555
I am attempting to upstream this patch here: https://github.com/xianyi/OpenBLAS/pull/3773
I'm exited to see this merged, so I can test/use on master, since its only point is faster Julia startup, right? How faster expected?
So, is the "1 failing" check a false alarm? Does the test need to be rerun?
At the very start for freebsd64:
fcntl(): Bad file descriptor
Executing precompile statements... 124/1927
fcntl(): Bad file descriptor
[..]
Executing precompile statements... 682/1927
Killed
gmake[1]: *** [sysimage.mk:89: /usr/home/julia/buildbot/w1_builder/package_freebsd64/build/usr/lib/julia/sys-o.a] Error 1
Executing precompile statements... 683/1927
[..]
Sadly, I can confirm that while this does help a little, it's not very significant. Here are the "commit charge" graphs for the start of the testset for both master and this branch:
master
this branch

Isolation tests
With gdb, I have verified that the changes cause OpenBLAS to see only one thread at startup (which is subsequently increased when LinearAlgebra.__init__() sets the number of threads). Looking at the startup commit charge for master versus this branch, I do see a difference:

In the above screenshot, PID 5820 is this branch, whereas PID 2052 is master, and both are simply sitting at the REPL after startup with no other options applied. So we should be saving some amount of memory, but apparently it is insignificant compared to the memory that we are using during testing.
I'm confused, that's half of (Commit) memory, and that's useful? Or is "Working Set" or some other, a more useful indicator? [Both good indicators, and why the latter (and all other showing) slightly higher?] Test on e.g. Linux too?
I'm confused, that's half of (Commit) memory, and that's useful?
Yes, that is useful, but we were hoping that this would reduce the wildly high commit memory during tests, which does not seem to be the case.
Could you spell out why this would this help with memory during the tests?
The thought is that OpenBLAS has its own function that tries to auto-guess how many threads to initialize with, which doesn't agree with ours, where we generally divide by two to avoid over-saturating processors with hyperthreading. While adjusting thread counts is not a problem, we hypothesized that the high commit memory usage during tests might be due to OpenBLAS starting up with (for example) 16 threads, each pre-allocating a bunch of memory for a workspace, and then when we set OpenBLAS back to 8 threads later, the memory may not actually get freed back to the OS. So we short-circuit that initial allocation and only allocate up to the number of threads we're actually interested in.
Just to test, what happens if you run the test suite with OPENBLAS_NUM_THREADS=1? That should confirm if it is indeed OpenBLAS or something else I think.
I ran into the issue with too many threads on a server with 32 cores / 64 threads and stack ulimit at 8M and user proc limit at 512.
However it seems to me that this patch does nothing at startup, at least when applied to 1.8.2, as the environment variables are set after libopenblas64_.so is loaded already. In the below (unpatched 1.8.2 official binary) you'd expect OPENBLAS_MAIN_FREE to be set, as it is in Julia 1.7, but it's not.
$ bin/julia
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.8.2 (2022-09-29)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia> ENV["OPENBLAS_MAIN_FREE"]
ERROR: KeyError: key "OPENBLAS_MAIN_FREE" not found
Stacktrace:
[1] (::Base.var"#657#658")(k::String)
@ Base ./env.jl:79
[2] access_env
@ ./env.jl:43 [inlined]
[3] getindex(#unused#::Base.EnvDict, k::String)
@ Base ./env.jl:79
[4] top-level scope
@ REPL[1]:1
If I also put the ENV logic in LinearAlgebra.jl's init like below it works fine. I don't know enough about Julia internals and haven't tested if master is different, and the init there goes picks up the OpenBLAS jll, but for the backports this is important to know for sure.
@@ -566,6 +566,17 @@
try
libblas_path = find_library_path(Base.libblas_name)
liblapack_path = find_library_path(Base.liblapack_name)
+
+ # make sure OpenBLAS does not set CPU affinity (#1070, #9639)
+ if !haskey(ENV, "OPENBLAS_MAIN_FREE")
+ ENV["OPENBLAS_MAIN_FREE"] = "1"
+ end
+
+ # Ensure that OpenBLAS does not automatically start up a bunch of threads,
+ # it will only do so if requested by another environment variable, or when
+ # LinearAlgebra initializes in its `__init__()` method:
+ ENV["OPENBLAS_DEFAULT_NUM_THREADS"] = "1"
+
# We manually `dlopen()` these libraries here, so that we search with `libjulia-internal`'s
# `RPATH` and not `libblastrampoline's`. Once it's been opened, when LBT tries to open it,
# it will find the library already loaded.
Can this be merged? Win32 CI is "borken": I do see an error for Download, on tester_win32
nested task error: RequestError: getaddrinfo() thread failed to start while requesting https://httpbingo.julialang.org/delay/2?id=1
[It's annoying how often (in general) CI fails, seemingly unrelated, so would be good to know if I'm right on that, and if it holds stuff up needlessly, or if there are other issue, needs review or something.]
Can/should we merged this even if not 100% successful (for original goal)?
that's half of (Commit) memory, and that's useful?
Yes, that is useful, but we were hoping that this would reduce the wildly high commit memory during tests
However it seems to me that this patch does nothing at startup, at least when applied to 1.8.2
I think the "1 failing check" is a false alarm, so I'm thinking, is this at least never worse? I' interested in this, less to reduce memory/nr. of threads, but if it starts up Julia faster.
Back-porting to 1.8.3 could then be done later when 100% fixed, or now, then again later?
Bump. Can this be merged, it helps "a little" (never hurting?), though 45% reduction in commit mem., seems a lot... I'm not sure it holds up 1.8.3.
"Merge me"? There's no failing check, Win32 CI is known to be "borken".
Win32 CI is known to be "borken".
Is it?
I was quoting (so highly unlikely "Download" failure isn't a a false alarm, also only on Win32): https://github.com/JuliaLang/www.julialang.org/pull/1771#issuecomment-1298486917
"merge me"? Tier 1 platforms pass.
It's though strange (to me) that musl and PowerPC have test failures, still "All checks have passed" (so likely intentional to ignore them):
Test Summary: | Pass Error Broken Total Time
Overall | 29824326 5 352698 30177029 69m37.5s
Can this trivial passing PR be merged? I want to test it (also in combination with other PRs, i.e. for JuliaSyntax #46372, I'm waiting to be merged). It's also 1 of 2 PRs yet to be merged for 1.8.3 (or maybe the only one left, the other seems problematic), and I don't want 1.10 or 1.8.3 held up.
Alright, we finally managed to do some more testing on this on some large core-count Windows machines, and this helps significantly, so I'm going to merge.
Just for my understanding, for a large core-count machine this would at most save half of the memory compared to before? With this change, we set OpenBLAS to 1 thread, then LinearAlgebra.__init__ runs:
https://github.com/JuliaLang/julia/blob/58b559f4a238faeeac03fbdec181ededd27053bc/stdlib/LinearAlgebra/src/LinearAlgebra.jl#L646
and set the number of OpenBLAS threads to Sys.CPU_THREADS ÷ 2 (which is half the threads OpenBLAS would set earlier`)
If so, even with this, it is probably worth doing something like https://github.com/JuliaLang/julia/pull/46844?notification_referrer_id=NT_kwDOABOSg7I0NDM5Nzc3MjQ1OjEyODI2OTE¬ifications_query=is%3Aunread#issuecomment-1264261084 (setting the number of openblas threads to 1 for most of the test suite).
Yeah, ref https://github.com/JuliaCI/julia-buildkite/pull/247 for changes specific to CI.
Actually, something seemed fishy to me about how much this helped the internal workload, and I just tried the following on Julia v1.8.2:
julia> using Distributed
addprocs(1)
@everywhere begin
using LinearAlgebra
@show LinearAlgebra.BLAS.get_num_threads()
end
LinearAlgebra.BLAS.get_num_threads() = 16
From worker 2: LinearAlgebra.BLAS.get_num_threads() = 1
So it appears that Distributed already attempts to set the BLAS threads to 1. Looking into the source, I found:
- https://github.com/JuliaLang/julia/blob/626f3b20ef11fe4aa3ac4576a869cf7dd4e34558/stdlib/Distributed/src/process_messages.jl#L336
- https://github.com/JuliaLang/julia/blob/626f3b20ef11fe4aa3ac4576a869cf7dd4e34558/base/initdefs.jl#L431
- https://github.com/JuliaLang/julia/blob/626f3b20ef11fe4aa3ac4576a869cf7dd4e34558/stdlib/LinearAlgebra/src/LinearAlgebra.jl#L639
So what this means is that we actually already try to restrict OpenBLAS to 1 thread on CI, but because OpenBLAS used to start up and immediately set a number of threads, we would have the problem of initially starting up and consuming a bunch of memory, then never letting go of it.
So all that being said, I believe that with this PR merged, we've actually solved the root problem. We should watch it closely on CI, but I think we've only ever tried to exercise single-threaded BLAS on CI so far, so we should just continue to do that. :)