OpenBLAS icon indicating copy to clipboard operation
OpenBLAS copied to clipboard

Stack overflow when compiled with 4096 NUM_THREADS

Open ViralBShah opened this issue 2 years ago • 7 comments

We decided to set a very high NUM_THREADS for building OpenBLAS - 4096. That leads to a stack overflow in getrf_parallel. I believe this setting should have led to ALLOC_HEAP being used, but perhaps 4096 is still too large for all the other buffers getting stack allocated.

https://github.com/JuliaLang/julia/issues/42591

ViralBShah avatar Oct 11 '21 20:10 ViralBShah

4096 ?? Why would you even want to do that, do you have (or know of) any hardware with that many cores ? Presumably you would need to raise the default ulimit your operating system puts on stacksize as well.

martin-frbg avatar Oct 12 '21 07:10 martin-frbg

We used to have it be conservative and have to repeatedly keep bumping it up. So, we decided to just make it really big and not have to worry about it.

A really nice feature would be if it were possible to heap allocate always based on the number of threads openblas started with - so that this was not a compile time decision (for the default Julia binaries we make available for download).

ViralBShah avatar Oct 12 '21 13:10 ViralBShah

I can look into allocating the queue array on the heap as well when ALLOC_HEAP is set, but just choosing an unrealistically high number of threads and hoping to get away with it without increasing the stack limits imposed by the shell is asking for trouble IMHO. BTW 0.3.17 already added a fallback function to allocate additional space for another 512 threads on the heap when it runs out of the compile-time NUM_THREADS so hopefully it should no longer be necessary to build with a scary NUM_THREADS for distribution.

martin-frbg avatar Oct 12 '21 14:10 martin-frbg

Oh that's interesting - and I suppose would solve our problem. So what would you recommend as the number of NUM_THREADS to build with? It would be great to allocate more on the heap esp. when ALLOC_HEAP is set. Also, is it possible to use arg->nthreads and allocate for number of threads openblas is using, instead of for MAX_CPU_NUMBER (which IIUC will be NUM_THREADS).

@staticfloat Please note this point.

ViralBShah avatar Oct 12 '21 14:10 ViralBShah

I have no idea of the range of hardware your software encounters, but I'd be surprised if you needed more than 512. Allocating based on the current arg->nthreads is a bit tricky as some things probably have to be in place already at the point where that number is known. (Remember that at the core of this is mostly ~15-20 years old code with little original documentation and poorly documented history for its early life as GotoBLAS).Also I do not currently have access to hardware with really large numbers of cores.

martin-frbg avatar Oct 12 '21 15:10 martin-frbg

I came to the same conclusion - that for now and the foreseeable future 512 is more than sufficient.

ViralBShah avatar Oct 12 '21 15:10 ViralBShah

We have now come down to 512 threads by default and that is working reliably in most cases. However, there are still some challenges here because when starting a distributed job (multiple Julia processes) on the same node, each process tries to initialize openblas with the max number of threads and the OS runs out of some kind of thread limits. For example, on aarch64, we get this error:

OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max

On distributed jobs, we usually set the number of openblas threads to 1, but that is happening after all the openblas library initialization and buffer allocation - perhaps a bit too late in the process.

ViralBShah avatar Nov 01 '21 13:11 ViralBShah

It should be possible to use smaller default now and have OpenBLAS allocate auxiliary buffer structures in case of overflow. (Maybe the number of buffers to add could be made into a build-time parameter, right now it is fixed at 512)

martin-frbg avatar Jan 14 '24 22:01 martin-frbg

Does OpenBLAS allocate these auxiliary buffers automatically? If so, can we move down from something like 512 to 16 threads? Or even fewer?

ViralBShah avatar Jan 24 '24 19:01 ViralBShah

Yes it does, but the number of auxiliary buffers is currently fixed at 512 - though it should be trivial to make that configurable at build time. Also just one round of expansion.

martin-frbg avatar Jan 24 '24 20:01 martin-frbg

@martin-frbg Thanks for your comment. Is that different from setting NUM_THREADS or does NUM_THREADS still represent the maximum number of threads openblas can use?

It would be nice to reduce the default number of auxiliary buffers if they allocate significantly memory or at least make it configurable.

ViralBShah avatar Jan 24 '24 20:01 ViralBShah

@martin-frbg based on quick experiments with this configuration make DYNAMIC_ARCH=1 LIBPREFIX=libopenblas64_ INTERFACE64=1 SYMBOLSUFFIX=64_ NUM_THREADS=16 -j36, Makefile variable NUM_THREADS appears to still set the maximum number of threads one can possibly have:

julia> using OpenBLAS_jll, LinearAlgebra

julia> strip(unsafe_string(ccall((BLAS.@blasfunc(openblas_get_config), libopenblas), Ptr{UInt8}, () )))
"OpenBLAS 0.3.26.dev  USE64BITINT DYNAMIC_ARCH NO_AFFINITY neoversev1 MAX_THREADS=16"

julia> BLAS.get_num_threads()
16

julia> BLAS.set_num_threads(72)

julia> BLAS.get_num_threads()
16

julia> BLAS.set_num_threads(8)

julia> BLAS.get_num_threads()
8

Even if on this system I have 72 threads, set_num_threads refuses to set a number of threads larger than NUM_THREADS, which was 16 at compile time. Is that accurate? @ViralBShah If so, I don't think we want to reduce NUM_THREADS in our builds?

giordano avatar Jan 26 '24 16:01 giordano