OpenBLAS
OpenBLAS copied to clipboard
WIP: allow threading backend to be replaced by caller
This PR adds a new function openblas_set_threads_callback
that allows the caller to register a callback in order to change the threading backend at runtime — instead of spawning its own threads and passing work to them, OpenBLAS will pass an array of work (an array of blas_queue_t
) to the callback, which can execute the work in serial or in parallel using whatever mechanism it wants.
Motivation: this allows OpenBLAS threading to be composable with caller multithreading, instead of having OpenBLAS and the caller fight over the same CPUs. In particular, this enables OpenBLAS threading to be composable with:
- Julia's new partr threading scheduler, as discussed in JuliaLang/julia#32786
- Callers using Intels's Threading Building Blocks (TBB), Cilk, etcetera.
Because registering the callback is done at runtime, people wanting to use this feature won't need to install a separate compiled OpenBLAS binary, but can instead use an existing threaded binary.
The design and implementation is very similar to the new pluggable threading backend for FFTW (FFTW/fftw3#175) and Blosc (Blosc/c-blosc2#81), which worked out very well for us. My expectation is that the code additions will be fairly minimal—a few dozen lines plus some trivial refactoring—with a practical overhead of a single if
statement in the exec_blas
function for people not using this feature.
This PR is an early work-in-progress, but I wanted to get some early feedback. To do:
- [ ] Testing
- [ ] Add to pthreads backend (
blas_server.c
) - [ ] Add to win32 backend (
blas_server_win32.c
) - [ ] Documentation
In particular, I'm thinking the prototypical user callback could look like:
void callback(void *callback_data, int sync, void (*dojob)(int, void *, void *), int numjobs, size_t jobdata_elsize, void *jobdata, void *dojob_data) {
int i;
for (i = 0; i < numjobs; ++i) // in parallel
dojob(THREADID, ((char *) jobdata) + ((unsigned) i)*jobdata_elsize, dojob_data);
if (sync)
// wait for dojob calls to complete
}
where the for
loop is executed in parallel in some way determined by the caller, and THREADID
is some caller code for determining the current thread number (0
to nthreads-1
, where nthreads
was the value passed to omp_set_num_threads
).
(Note that this will also allow people using OpenMP to link to a pthreads OpenBLAS binary without recompiling OpenBLAS — with a 10-line callback using #pragma omp parallel for
, they can tell the pthreads OpenBLAS to use their OpenMP threads instead of its own pthreads.)
That's an intriguing concept... I just hope it does not wake any sleeping bugs.
Is further work required to move this forward? Thanks for pushing composable threading, @stevengj!
IF this is expected to work on Win32, it will need adjustments in blas_server_win32.c similar to those in/for blas_server_omp.c (this is what causes one of the two remaining Travis failures, the other is an unrelated issue with powerpc)
Is further work required to move this forward?
Yes, a significant amount of work: see un-checked to-do items above. It looks like the most effort is required to adapt blas_server.c
, which I'm currently working on now.
I haven't had a chance to work on it for a few days, but I'm hoping to get back to it next week.
Right, somehow I missed the checklist at the bottom of your post 😬
@stevengj: I haven't had a chance to work on it for a few days, but I'm hoping to get back to it next week.
Any update on this? :)
@stevengj 18 month later, just checking in where things stand on this... 🙃
Would be great to get this done!
Looks very interesting. Do you think it would provide a solution to this issue https://github.com/xianyi/OpenBLAS/issues/3187 ?
rebased for convenience (it is not immediately clear to me how to apply the proposed changes to the non-OMP server files) 2255rebased.txt
@stevengj Any update on this? Can you please share your email id, would like to discuss few doubts. Your help is highly appreciated.
closing as superseded by #4577 - nevertheless an important contribution, and I regret not having been able to advance it at the time