OpenBLAS icon indicating copy to clipboard operation
OpenBLAS copied to clipboard

Stack overflow segfault in getrf_parallel using Breeze / netlib

Open luhenry opened this issue 4 years ago • 5 comments

Hi all,

This issue is very similar to https://github.com/xianyi/OpenBLAS/issues/1082.

We are still running into this issue as seen in https://github.com/luhenry/netlib/issues/2. This is with libopenblas-dev 0.3.5+ds-3 on Debian 10.

From looking at the code, wouldn't it be possible to adopt the same approach as potrf_parallel which, instead of declaring this job[] on the stack in the function that is called recursively, declares the job[] in a thread_driver function that isn't called recursively?

The overall problem isn't so much that it allocates 500 Kbytes of memory on the stack (even if arguably, 500 Kbytes is a lot for the stack), but more that it does so recursively, quickly reaching sizes of 1 Mbytes+ while it doesn't even need it in the first few levels of recursion.

Thank you very much, and very happy to provide any other information.

luhenry avatar May 13 '21 21:05 luhenry

Seems you are calling OpenBLAS from Java, so one option would be to increase the very low default for its thread stack size (-Xss). Or you could build your OpenBLAS with a custom GETRF_MEM_ALLOC_THRESHOLD (probably 32 as used on Windows which IIRC has similarly restrictive defaults)

martin-frbg avatar May 14 '21 06:05 martin-frbg

You were concerned regarding legalese here: https://github.com/luhenry/netlib/issues/2#issuecomment-840856190

If you put LICENSE file along the binary (or in documentation directory, or embed in windows properies, your aesthetic standard prevails) you are well set, you can even sell it in that form if you like.

Since your users will at times come here, please keep visible build recipe for redist binary, we always try to help them (this is not required as above)

brada4 avatar May 14 '21 09:05 brada4

@martin-frbg what you're proposing are very valid workarounds. However, do you think it could be also possible to change OpenBLAS so that it doesn't allocate unnecessarily on the stack? Here, I'm not talking about changing the conditions on when to allocate on the stack or on the heap, but instead to allocate the jobs[] only when necessary and not on every level of the recursive call to getrf_parallel. Something similar is already done in potrf_parallel, and from a cursory glance it seems a similar approach could be applied to getrf_parallel. What do you think?

@brada4 that's great information, thank you! If I understand correctly, it is something that Numpy is already doing, correct? Do you know where I could find a list of the run-time dependencies of OpenBLAS?

luhenry avatar May 14 '21 20:05 luhenry

I have not had time to look at this in detail, but in general I am a bit wary of the old GotoBLAS2 codes for which no meaningful version history exists. The structure of getrf_parallel is a bit different from potrf_parallel, despite both having been devised by the same person

martin-frbg avatar May 14 '21 21:05 martin-frbg

Runtime dependencies Linux: libc (pthreads, OpenMP if compiled so) + gfortran (if lapack is used) - see ldd and lddtree, all lgpl at most, omnipresent. Windows: msvcrt, windows threads are included, LAPACK pulls gfortran + GOMP - i.e none, then LGPL, if someone could manage MS OpenMP, then one pain less, "Dependency Walker" is your friend, x64dbg or procexp shows "light" dependencies actually loaded, watch for DLL HELL - same lib name from distince places. CLANG/FLANG - their respective runtimes, lighter than LGPL

brada4 avatar May 15 '21 06:05 brada4

moved the topic of a potential rewrite of getrf_parallel to github projects

martin-frbg avatar Aug 14 '22 15:08 martin-frbg