OpenBLAS icon indicating copy to clipboard operation
OpenBLAS copied to clipboard

Want fallback on thread creation failure

Open markati opened this issue 10 years ago • 8 comments
trafficstars

It seems that openblas just dies when pthread_create fails. It should instead continue execution with the threads already created, or at least it should fall back on the single-thread mode. pthread_create often fails on a many-core machine if an application is launched in parallel.

markati avatar Mar 28 '15 08:03 markati

Thank for the suggestion. I will implement this feature.

xianyi avatar Mar 29 '15 15:03 xianyi

Any news on this feature?

groutr avatar Sep 17 '15 22:09 groutr

@groutr , I didn't implement it yet.

xianyi avatar Oct 05 '15 19:10 xianyi

I merged the patch, which raises a signal when pthread_create fails.

70642fe4ed4ffd74d305cd5c76cd6425dba4bbd1

Is it enough for this feature request?

xianyi avatar Oct 27 '15 00:10 xianyi

I'm afraid the patch is making situation worse.

When raise(SIGINT) is called, a signal handler is called back. The signal handler returns, and then raise(SIGINT) returns with 0. The for-loop in which raise(SIGINT) was called continue creating threads, assuming, without any check, the stillborn thread has somehow been treated with by the signal handler.

Since this is a BLAS library, it is often the case that it is embedded in an application without specific preference to OpenBLAS, and the application may have installed a signal handler that is unrelated to OpenBLAS...maybe for the sake of the application itself. What if such a handler is called by raise(SIGINT)? OpenBLAS will resume execution with some working threads left dead.

Application authors therefore must make sure that they have installed an appropriate signal handler, or that no signal handler has been installed, before calling BLAS functions.

Furthermore, if application authors are aware that they must write a signal handler, they have nothing they can do in the handler. For instance, a signal handler cannot access blas_threads[i], in which a handle to the stillborn thread has been stored; since 'blas_threads' is a static global variable and 'i' is an auto variable. The signal handler can nothing but call exit() because, if the handler returns, OpenBLAS will behave insanely. (The handler cannot perform longjmp() either. The behavior is undefined)

markati avatar Nov 06 '15 11:11 markati

Revisiting this (and associated PR #668), wouldn't a better error behaviour be to

  • exit the thread creation loop
  • set blas_num_threads to the number successfully created up to that point
  • write a message to stderr
  • continue running with what we have ?

A caller could probably still query the number of threads actually created and raise a signal if desired, or call goto_set_num_threads() to retry the creation of any "missing" threads. Additionally, it seems #668 made no attempt to handle the other potential source of pthread_create() failures, in goto_set_num_threads().

martin-frbg avatar May 13 '18 13:05 martin-frbg

is there any workaround for the moment ?

cipri-tom avatar May 24 '18 22:05 cipri-tom

Depends on your use case, trivially you could build OpenBLAS single-threaded. Otherwise perhaps reverting the patch and removing the "exit" statement from the previous code will be a start, and/or checking in your own code if thread creation can be expected to work before calling OpenBLAS. Does what I sketched out in my earlier message two weeks ago sound plausible ?

martin-frbg avatar May 26 '18 04:05 martin-frbg