DLA-Future icon indicating copy to clipboard operation
DLA-Future copied to clipboard

BLAS/LAPACK low-level wrappers

Open albestro opened this issue 4 years ago • 0 comments

In our codebase we define wrappers for both BLAS and LAPACK allowing us to achieve two main features:

  • make these calls tile-oriented (in contrast to memory oriented like low-level original calls)
  • have a single entry point for both CPU and GPU version of the routine (via the callable object)

In some cases it is also needed to compose and use these functions in a less general way, where having access to the low-level original routine is needed (see https://github.com/eth-cscs/DLA-Future/pull/513#discussion_r840334204 as an example). There, even if we don't have to deal anymore with tiles directly, it would still be desirable to have a single entry point for different backends, so that the code can be generalized (and not duplicated).

The first thought is to split the two main features on the two levels:

  • 1st level: tile-oriented wrappers, which will be responsible of "unpacking" the tile parameters to the low level routine wrapper;
  • 2nd level: introduce a low level routine wrapper, which will provide a single entry point for the different backends available.

Unfortunately, this is not a fully definite solution, so I opened this issue in order to collect problems, requirements and thoughts, to see if we can engineer a better design.

I will start with a couple of problems we are already aware of:

  1. CPU vs GPU wrappers does not have the same number of parameters (i.e. CPU does not have an handle)
  2. some wrappers requires the handle as first parameter, some others as last one
  3. there isn't just a single handle type for all wrappers (cublas, cusolver, custream, ... ?HIP, ?ROCM)
  4. previous points all together result problematic when having to compose more wrappers inside a single function (e.g. composing cublas with a cusolver requires two different handles, so it is not even clear what the wrapping functions should accept. moreover if it has to be generic, it has also to be valid for cpu that does not require any handle).

A random thought just as a seed for starting a discussion: create a "generic" handler valid for all kind of wrappers (i.e. fixing point 3), passed to all wrappers (fixing point 1) as last (fixing point 2), which contains all available handlers (not sure if it is something feasible).

albestro avatar Apr 01 '22 14:04 albestro