OpenBLAS
OpenBLAS copied to clipboard
Want fallback on thread creation failure
It seems that openblas just dies when pthread_create fails. It should instead continue execution with the threads already created, or at least it should fall back on the single-thread mode. pthread_create often fails on a many-core machine if an application is launched in parallel.
Thank for the suggestion. I will implement this feature.
Any news on this feature?
@groutr , I didn't implement it yet.
I merged the patch, which raises a signal when pthread_create fails.
70642fe4ed4ffd74d305cd5c76cd6425dba4bbd1
Is it enough for this feature request?
I'm afraid the patch is making situation worse.
When raise(SIGINT) is called, a signal handler is called back. The signal handler returns, and then raise(SIGINT) returns with 0. The for-loop in which raise(SIGINT) was called continue creating threads, assuming, without any check, the stillborn thread has somehow been treated with by the signal handler.
Since this is a BLAS library, it is often the case that it is embedded in an application without specific preference to OpenBLAS, and the application may have installed a signal handler that is unrelated to OpenBLAS...maybe for the sake of the application itself. What if such a handler is called by raise(SIGINT)? OpenBLAS will resume execution with some working threads left dead.
Application authors therefore must make sure that they have installed an appropriate signal handler, or that no signal handler has been installed, before calling BLAS functions.
Furthermore, if application authors are aware that they must write a signal handler, they have nothing they can do in the handler. For instance, a signal handler cannot access blas_threads[i], in which a handle to the stillborn thread has been stored; since 'blas_threads' is a static global variable and 'i' is an auto variable. The signal handler can nothing but call exit() because, if the handler returns, OpenBLAS will behave insanely. (The handler cannot perform longjmp() either. The behavior is undefined)
Revisiting this (and associated PR #668), wouldn't a better error behaviour be to
- exit the thread creation loop
- set blas_num_threads to the number successfully created up to that point
- write a message to stderr
- continue running with what we have ?
A caller could probably still query the number of threads actually created and raise a signal if desired, or call goto_set_num_threads() to retry the creation of any "missing" threads. Additionally, it seems #668 made no attempt to handle the other potential source of pthread_create() failures, in goto_set_num_threads().
is there any workaround for the moment ?
Depends on your use case, trivially you could build OpenBLAS single-threaded. Otherwise perhaps reverting the patch and removing the "exit" statement from the previous code will be a start, and/or checking in your own code if thread creation can be expected to work before calling OpenBLAS. Does what I sketched out in my earlier message two weeks ago sound plausible ?