blis Feature: gemmd product with 4th loop parallelization

Summary:

Implement the gemmd product, which is gemm with a diagonal matrix of "weights" inserted in the middle. Formally, compute A * diag(d) * B for A (mxk), B (kxn), d (k).
Enable parallelization on the 4th loop (the PC loop), currently via OpenMP only.

Use case:

For matrices with large k and relatively smaller n and m, computing either of the intermediate products Ad or dB is "wasteful", since they both have a dimension of size k. This spends both time to calculate and memory to hold the large result. gemmd can compute the result without the overhead of evaluating the intermediate result.
We additionally parallelize the PC loop because this is the only loop over the k dimension. As gemmd is most useful when k is large, parallelizing this loop can have a major impact on performance in this use case.

Aug 27 '21 12:08 JerryMaoQC

Sorry for the delay in looking at this, @JerryMaoQC. I do aim to get to it soon.

Aug 31 '21 15:08 fgvanzee

@JerryMaoQC I noticed you chose the sandbox name gemmd. However, your operation is still called gemm (with APIs via bls_gemm(), bls_gemm_ex(), bls_?gemm()). Was this name (and the API names) chosen intentionally? If not, I'd be happy to help you change the filenames and function names.

Sep 02 '21 17:09 fgvanzee

@fgvanzee and @JerryMaoQC what is the impetus to include this in BLIS mainline?

Oct 04 '21 21:10 devinamatthews

@devinamatthews I suggested that @JerryMaoQC could submit it since there is no harm (that I could see) in having the extra sandbox directory there for posterity and in case others want to study and/or build on his work.

Oct 04 '21 22:10 fgvanzee

Sure. No objection I was just curious.

Oct 04 '21 22:10 devinamatthews