julia icon indicating copy to clipboard operation
julia copied to clipboard

Limit initial OpenBLAS thread count

Open staticfloat opened this issue 3 years ago • 3 comments

We set OpenBLAS's initial thread count to 1 to prevent runaway allocation within OpenBLAS's initial thread startup. LinearAlgebra will later call BLAS.set_num_threads() to the actual value we require.

staticfloat avatar Sep 20 '22 16:09 staticfloat

Also GOTO_NUM_THREADS? And I think you can recover the test for this from https://github.com/JuliaLang/julia/pull/42442/files#diff-e299fbf3f1935b1bb9e7ef61e4644dda078ae51fab7559caba8da03817943c9b

vtjnash avatar Sep 20 '22 17:09 vtjnash

I think you can recover the test for this from https://github.com/JuliaLang/julia/pull/42442/files#diff-e299fbf3f1935b1bb9e7ef61e4644dda078ae51fab7559caba8da03817943c9b

I don't see how that is testing this functionality; that's testing that we limit the number of threads to a maximum of 8, but that's not what we do anymore; and in fact, our logic is much closer to what OpenBLAS itself natively does, so it's difficult to test that this is working.

staticfloat avatar Sep 20 '22 17:09 staticfloat

Ah, true. Now that we are later also explicitly changing it, we would need to do that check in __init__, before we set it.

vtjnash avatar Sep 20 '22 18:09 vtjnash

LinearAlgebra.__init__ only calls set_num_threads if !haskey(ENV, "OPENBLAS_NUM_THREADS"); does this interfere with that?

JeffBezanson avatar Sep 21 '22 23:09 JeffBezanson

No; so what happens is, if a thread count is already set, neither this logic nor the LinearAlgebra __init__() logic messes with it.

staticfloat avatar Sep 21 '22 23:09 staticfloat

I think I agree with Jeff that this will break LinearAlgebra.__init__, since it will be set now

vtjnash avatar Sep 22 '22 06:09 vtjnash

Oh, of course, my bad. I totally misunderstood the question, and confused myself.

One possibility is to unset the environment variable after dlopen() has returned, but I'm not crazy about that idea. What do you think, Jameson?

staticfloat avatar Sep 22 '22 17:09 staticfloat

I would perhaps rather just patch openblas to default to 1 thread at startup, though this might harm other apps that use it

vtjnash avatar Sep 22 '22 17:09 vtjnash

Okay, I'll just add a short-circuit to OpenBLAS's default thread calculation then. X-ref: https://github.com/JuliaPackaging/Yggdrasil/pull/5555

staticfloat avatar Sep 22 '22 17:09 staticfloat

I am attempting to upstream this patch here: https://github.com/xianyi/OpenBLAS/pull/3773

staticfloat avatar Sep 22 '22 19:09 staticfloat

I'm exited to see this merged, so I can test/use on master, since its only point is faster Julia startup, right? How faster expected?

So, is the "1 failing" check a false alarm? Does the test need to be rerun?

At the very start for freebsd64:

fcntl(): Bad file descriptor
Executing precompile statements... 124/1927
fcntl(): Bad file descriptor
[..]
Executing precompile statements... 682/1927
Killed
gmake[1]: *** [sysimage.mk:89: /usr/home/julia/buildbot/w1_builder/package_freebsd64/build/usr/lib/julia/sys-o.a] Error 1
Executing precompile statements... 683/1927
[..]

PallHaraldsson avatar Sep 27 '22 08:09 PallHaraldsson

Sadly, I can confirm that while this does help a little, it's not very significant. Here are the "commit charge" graphs for the start of the testset for both master and this branch:

master

master

this branch

branch

Isolation tests

With gdb, I have verified that the changes cause OpenBLAS to see only one thread at startup (which is subsequently increased when LinearAlgebra.__init__() sets the number of threads). Looking at the startup commit charge for master versus this branch, I do see a difference:

image

In the above screenshot, PID 5820 is this branch, whereas PID 2052 is master, and both are simply sitting at the REPL after startup with no other options applied. So we should be saving some amount of memory, but apparently it is insignificant compared to the memory that we are using during testing.

staticfloat avatar Sep 30 '22 16:09 staticfloat

I'm confused, that's half of (Commit) memory, and that's useful? Or is "Working Set" or some other, a more useful indicator? [Both good indicators, and why the latter (and all other showing) slightly higher?] Test on e.g. Linux too?

PallHaraldsson avatar Sep 30 '22 16:09 PallHaraldsson

I'm confused, that's half of (Commit) memory, and that's useful?

Yes, that is useful, but we were hoping that this would reduce the wildly high commit memory during tests, which does not seem to be the case.

staticfloat avatar Sep 30 '22 17:09 staticfloat

Could you spell out why this would this help with memory during the tests?

KristofferC avatar Sep 30 '22 17:09 KristofferC

The thought is that OpenBLAS has its own function that tries to auto-guess how many threads to initialize with, which doesn't agree with ours, where we generally divide by two to avoid over-saturating processors with hyperthreading. While adjusting thread counts is not a problem, we hypothesized that the high commit memory usage during tests might be due to OpenBLAS starting up with (for example) 16 threads, each pre-allocating a bunch of memory for a workspace, and then when we set OpenBLAS back to 8 threads later, the memory may not actually get freed back to the OS. So we short-circuit that initial allocation and only allocate up to the number of threads we're actually interested in.

staticfloat avatar Sep 30 '22 18:09 staticfloat

Just to test, what happens if you run the test suite with OPENBLAS_NUM_THREADS=1? That should confirm if it is indeed OpenBLAS or something else I think.

KristofferC avatar Oct 01 '22 06:10 KristofferC

I ran into the issue with too many threads on a server with 32 cores / 64 threads and stack ulimit at 8M and user proc limit at 512.

However it seems to me that this patch does nothing at startup, at least when applied to 1.8.2, as the environment variables are set after libopenblas64_.so is loaded already. In the below (unpatched 1.8.2 official binary) you'd expect OPENBLAS_MAIN_FREE to be set, as it is in Julia 1.7, but it's not.

$ bin/julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.8.2 (2022-09-29)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> ENV["OPENBLAS_MAIN_FREE"]
ERROR: KeyError: key "OPENBLAS_MAIN_FREE" not found
Stacktrace:
 [1] (::Base.var"#657#658")(k::String)
   @ Base ./env.jl:79
 [2] access_env
   @ ./env.jl:43 [inlined]
 [3] getindex(#unused#::Base.EnvDict, k::String)
   @ Base ./env.jl:79
 [4] top-level scope
   @ REPL[1]:1

If I also put the ENV logic in LinearAlgebra.jl's init like below it works fine. I don't know enough about Julia internals and haven't tested if master is different, and the init there goes picks up the OpenBLAS jll, but for the backports this is important to know for sure.

@@ -566,6 +566,17 @@
     try
         libblas_path = find_library_path(Base.libblas_name)
         liblapack_path = find_library_path(Base.liblapack_name)
+
+        # make sure OpenBLAS does not set CPU affinity (#1070, #9639)
+        if !haskey(ENV, "OPENBLAS_MAIN_FREE")
+            ENV["OPENBLAS_MAIN_FREE"] = "1"
+        end
+
+        # Ensure that OpenBLAS does not automatically start up a bunch of threads,
+        # it will only do so if requested by another environment variable, or when
+        # LinearAlgebra initializes in its `__init__()` method:
+        ENV["OPENBLAS_DEFAULT_NUM_THREADS"] = "1"
+
         # We manually `dlopen()` these libraries here, so that we search with `libjulia-internal`'s
         # `RPATH` and not `libblastrampoline's`.  Once it's been opened, when LBT tries to open it,
         # it will find the library already loaded.

bartoldeman avatar Oct 04 '22 17:10 bartoldeman

Can this be merged? Win32 CI is "borken": I do see an error for Download, on tester_win32

nested task error: RequestError: getaddrinfo() thread failed to start while requesting https://httpbingo.julialang.org/delay/2?id=1

[It's annoying how often (in general) CI fails, seemingly unrelated, so would be good to know if I'm right on that, and if it holds stuff up needlessly, or if there are other issue, needs review or something.]

PallHaraldsson avatar Oct 13 '22 17:10 PallHaraldsson

Can/should we merged this even if not 100% successful (for original goal)?

that's half of (Commit) memory, and that's useful?

Yes, that is useful, but we were hoping that this would reduce the wildly high commit memory during tests

However it seems to me that this patch does nothing at startup, at least when applied to 1.8.2

I think the "1 failing check" is a false alarm, so I'm thinking, is this at least never worse? I' interested in this, less to reduce memory/nr. of threads, but if it starts up Julia faster.

Back-porting to 1.8.3 could then be done later when 100% fixed, or now, then again later?

PallHaraldsson avatar Oct 18 '22 13:10 PallHaraldsson

Bump. Can this be merged, it helps "a little" (never hurting?), though 45% reduction in commit mem., seems a lot... I'm not sure it holds up 1.8.3.

PallHaraldsson avatar Oct 27 '22 13:10 PallHaraldsson

"Merge me"? There's no failing check, Win32 CI is known to be "borken".

PallHaraldsson avatar Nov 02 '22 16:11 PallHaraldsson

Win32 CI is known to be "borken".

Is it?

giordano avatar Nov 02 '22 16:11 giordano

I was quoting (so highly unlikely "Download" failure isn't a a false alarm, also only on Win32): https://github.com/JuliaLang/www.julialang.org/pull/1771#issuecomment-1298486917

PallHaraldsson avatar Nov 02 '22 16:11 PallHaraldsson

"merge me"? Tier 1 platforms pass.

It's though strange (to me) that musl and PowerPC have test failures, still "All checks have passed" (so likely intentional to ignore them):

Test Summary:                                     |     Pass  Error  Broken     Total      Time
  Overall                                         | 29824326   5 352698 30177029 69m37.5s

PallHaraldsson avatar Nov 03 '22 15:11 PallHaraldsson

Can this trivial passing PR be merged? I want to test it (also in combination with other PRs, i.e. for JuliaSyntax #46372, I'm waiting to be merged). It's also 1 of 2 PRs yet to be merged for 1.8.3 (or maybe the only one left, the other seems problematic), and I don't want 1.10 or 1.8.3 held up.

PallHaraldsson avatar Nov 04 '22 12:11 PallHaraldsson

Alright, we finally managed to do some more testing on this on some large core-count Windows machines, and this helps significantly, so I'm going to merge.

staticfloat avatar Nov 15 '22 17:11 staticfloat

Just for my understanding, for a large core-count machine this would at most save half of the memory compared to before? With this change, we set OpenBLAS to 1 thread, then LinearAlgebra.__init__ runs:

https://github.com/JuliaLang/julia/blob/58b559f4a238faeeac03fbdec181ededd27053bc/stdlib/LinearAlgebra/src/LinearAlgebra.jl#L646

and set the number of OpenBLAS threads to Sys.CPU_THREADS ÷ 2 (which is half the threads OpenBLAS would set earlier`)

If so, even with this, it is probably worth doing something like https://github.com/JuliaLang/julia/pull/46844?notification_referrer_id=NT_kwDOABOSg7I0NDM5Nzc3MjQ1OjEyODI2OTE&notifications_query=is%3Aunread#issuecomment-1264261084 (setting the number of openblas threads to 1 for most of the test suite).

KristofferC avatar Nov 15 '22 17:11 KristofferC

Yeah, ref https://github.com/JuliaCI/julia-buildkite/pull/247 for changes specific to CI.

DilumAluthge avatar Nov 15 '22 17:11 DilumAluthge

Actually, something seemed fishy to me about how much this helped the internal workload, and I just tried the following on Julia v1.8.2:

julia> using Distributed
       addprocs(1)
       @everywhere begin
           using LinearAlgebra
           @show LinearAlgebra.BLAS.get_num_threads()
       end

LinearAlgebra.BLAS.get_num_threads() = 16
      From worker 2:    LinearAlgebra.BLAS.get_num_threads() = 1

So it appears that Distributed already attempts to set the BLAS threads to 1. Looking into the source, I found:

  • https://github.com/JuliaLang/julia/blob/626f3b20ef11fe4aa3ac4576a869cf7dd4e34558/stdlib/Distributed/src/process_messages.jl#L336
  • https://github.com/JuliaLang/julia/blob/626f3b20ef11fe4aa3ac4576a869cf7dd4e34558/base/initdefs.jl#L431
  • https://github.com/JuliaLang/julia/blob/626f3b20ef11fe4aa3ac4576a869cf7dd4e34558/stdlib/LinearAlgebra/src/LinearAlgebra.jl#L639

So what this means is that we actually already try to restrict OpenBLAS to 1 thread on CI, but because OpenBLAS used to start up and immediately set a number of threads, we would have the problem of initially starting up and consuming a bunch of memory, then never letting go of it.

So all that being said, I believe that with this PR merged, we've actually solved the root problem. We should watch it closely on CI, but I think we've only ever tried to exercise single-threaded BLAS on CI so far, so we should just continue to do that. :)

staticfloat avatar Nov 15 '22 18:11 staticfloat