OpenBLAS icon indicating copy to clipboard operation
OpenBLAS copied to clipboard

Build openBLAS with oneTBB parallelism support

Open vineel96 opened this issue 2 years ago • 9 comments

Hello, I have following doubts:

  1. how to build openBLAS with oneTBB parallelism support instead of openmp?
  2. which combination is better to build openBLAS with? openBLAS +openMP or openBLAS +oneTBB or openBLAS+pthreads?

vineel96 avatar May 28 '23 10:05 vineel96

There is currently no support for TBB in OpenBLAS, what people appear to be doing (with varying success if you search for earlier issues mentioning TBB) is use a single-threaded build of OpenBLAS with their TBB-parallelized program. The choice of pthreads or OpenMP depends largely on what your main program uses. At least in theory, OpenMP would offer better thread safety and easier thread affinity handling, but may incur some overhead. On many/most platforms, OpenMP is implemented on top of pthreads anyway, and if your main program uses OpenMP you will want OpenBLAS to be either built with OpenMP, or built-single threaded, to avoid having to sets of threads that do not know of each other and compete for resources.

martin-frbg avatar May 28 '23 11:05 martin-frbg

Hi @martin-frbg , Thanks for the reply.

  1. is there a way to modify makefile to build openBLAS with oneTBB similar to how openmp is given as option while building openBLAS? or only way is manually write oneTBB based code and use single threaded openBLAS?
  2. I am trying to build openBLAS with oneTBB so that it is algorithm/code independent. which means whatever algorithm that runs on top of openBLAS should make use of oneTBB without actually modifying the algorithm. is this right way to try? and can we see any performance gain by doing this?

vineel96 avatar May 29 '23 15:05 vineel96

there is currently no support for this - only an unfinished concept from four years ago in PR #2255

martin-frbg avatar May 29 '23 16:05 martin-frbg

1/ TBB equivalents of OMP PARALLEL pragmas 2/ TBB mutexes 3/ TBB malloc in place of current sbrk/mmap allocator

brada4 avatar May 30 '23 20:05 brada4

Hi @martin-frbg and @brada4, Thanks for the comments. I have some questions:

  1. The file changes that are proposed in https://github.com/xianyi/OpenBLAS/pull/2255 pull request isnt enough to merge it to openblas? Why its not merged or further taken to implement if any if it boosts performance?
  2. Also, can you suggest any room for improvement in openblas in threading part which helps improve parallelism and improve speed? Like any Gsoc project on this or i want to know if there is any scope in improving the threading part?
  3. Im intended to improve the threading performance w.r.t scikit learn library which uses openblas. So any ideas/leads regarding improving threading and parallelism part of scikit learn library?

vineel96 avatar Aug 04 '23 04:08 vineel96

  1. The PR has a list of unfinished tasks at the top, I'd like to see at least the pthreads backend changes implemented (and some testing, of course). Some people expressed interest in the changes, but nobody got around to actually doing anything, not even during the time when it could trivially be merged into a local checkout for testing. (And even now I expect the merge conflicts that are currently flagged by git can easily be resolved again - it is just that I did not get around to that yet).
  2. No Gsoc project, and probably nobody with the time to provide the associated mentoring, just plans and ongoing wkr to identify any remaining bottlenecks e.g. from excessive locking or use of too many threads for a given task. OpenBLAS does not have a big team of developers behind it, and never had.
  3. I'm not familiar with scikit learn, but I suspect it is using BLAS functions through NumPy. It would probably help to know which BLAS functions are involved, the "typical" matrix sizes in what people use scikit-learn for, and to get some fair comparison figures for OpenBLAS vs some other library on the same kind of hardware. So far there has only been #3925 which is basically "MKL on high-performance hardware is faster than OpenBLAS on low-performance hardware" with no data that would allow reproducing the reported problem.

martin-frbg avatar Aug 04 '23 05:08 martin-frbg

Hello @martin-frbg, Very much thanks for the replies and suggestions. 3. As of now scikit-learn is using gemm() function from scipy.linalg.cython_blas , mainly for matrix multiplication(eg kmeans). According to you, by knowing the sizes of mostly used matrices, this info will be usefull in optimizing algorithm further? 4. Another doubt: Is this PR (https://github.com/xianyi/OpenBLAS/pull/2255) is equivalent to building openBLAS with no threading support at all? Like building openBLAS without pthreads and openmp? 5. Also is it possible to build openBLAS no threaded version i.e build openBLAS from source without pthreads and without openmp? if yes what is the instructions/methods to build it?

vineel96 avatar Aug 29 '23 15:08 vineel96

Open a new issue since it is not related to TBB. Documentation is in Makefile.rule - USE_THREAD=0 USE_LOCKING=1

brada4 avatar Aug 29 '23 16:08 brada4

Hi @brada4, Thank you for the reply.

vineel96 avatar Sep 04 '23 05:09 vineel96

now fixed by #4577

martin-frbg avatar Apr 22 '24 17:04 martin-frbg