OpenBLAS
OpenBLAS copied to clipboard
Stack overflow segfault in getrf_parallel using Breeze / netlib
Hi all,
This issue is very similar to https://github.com/xianyi/OpenBLAS/issues/1082.
We are still running into this issue as seen in https://github.com/luhenry/netlib/issues/2. This is with libopenblas-dev 0.3.5+ds-3 on Debian 10.
From looking at the code, wouldn't it be possible to adopt the same approach as potrf_parallel which, instead of declaring this job[] on the stack in the function that is called recursively, declares the job[] in a thread_driver function that isn't called recursively?
The overall problem isn't so much that it allocates 500 Kbytes of memory on the stack (even if arguably, 500 Kbytes is a lot for the stack), but more that it does so recursively, quickly reaching sizes of 1 Mbytes+ while it doesn't even need it in the first few levels of recursion.
Thank you very much, and very happy to provide any other information.
Seems you are calling OpenBLAS from Java, so one option would be to increase the very low default for its thread stack size (-Xss). Or you could build your OpenBLAS with a custom GETRF_MEM_ALLOC_THRESHOLD (probably 32 as used on Windows which IIRC has similarly restrictive defaults)
You were concerned regarding legalese here: https://github.com/luhenry/netlib/issues/2#issuecomment-840856190
If you put LICENSE file along the binary (or in documentation directory, or embed in windows properies, your aesthetic standard prevails) you are well set, you can even sell it in that form if you like.
Since your users will at times come here, please keep visible build recipe for redist binary, we always try to help them (this is not required as above)
@martin-frbg what you're proposing are very valid workarounds. However, do you think it could be also possible to change OpenBLAS so that it doesn't allocate unnecessarily on the stack? Here, I'm not talking about changing the conditions on when to allocate on the stack or on the heap, but instead to allocate the jobs[] only when necessary and not on every level of the recursive call to getrf_parallel. Something similar is already done in potrf_parallel, and from a cursory glance it seems a similar approach could be applied to getrf_parallel. What do you think?
@brada4 that's great information, thank you! If I understand correctly, it is something that Numpy is already doing, correct? Do you know where I could find a list of the run-time dependencies of OpenBLAS?
I have not had time to look at this in detail, but in general I am a bit wary of the old GotoBLAS2 codes for which no meaningful version history exists. The structure of getrf_parallel is a bit different from potrf_parallel, despite both having been devised by the same person
Runtime dependencies
Linux: libc (pthreads, OpenMP if compiled so) + gfortran (if lapack is used) - see ldd and lddtree, all lgpl at most, omnipresent.
Windows: msvcrt, windows threads are included, LAPACK pulls gfortran + GOMP - i.e none, then LGPL, if someone could manage MS OpenMP, then one pain less, "Dependency Walker" is your friend, x64dbg or procexp shows "light" dependencies actually loaded, watch for DLL HELL - same lib name from distince places.
CLANG/FLANG - their respective runtimes, lighter than LGPL
moved the topic of a potential rewrite of getrf_parallel to github projects