Feature: gemmd product with 4th loop parallelization
Summary:
- Implement the
gemmdproduct, which isgemmwith a diagonal matrix of "weights" inserted in the middle. Formally, computeA * diag(d) * BforA (mxk),B (kxn),d (k). - Enable parallelization on the 4th loop (the PC loop), currently via OpenMP only.
Use case:
- For matrices with large k and relatively smaller n and m, computing either of the intermediate products Ad or dB is "wasteful", since they both have a dimension of size k. This spends both time to calculate and memory to hold the large result.
gemmdcan compute the result without the overhead of evaluating the intermediate result. - We additionally parallelize the PC loop because this is the only loop over the k dimension. As
gemmdis most useful when k is large, parallelizing this loop can have a major impact on performance in this use case.
Sorry for the delay in looking at this, @JerryMaoQC. I do aim to get to it soon.
@JerryMaoQC I noticed you chose the sandbox name gemmd. However, your operation is still called gemm (with APIs via bls_gemm(), bls_gemm_ex(), bls_?gemm()). Was this name (and the API names) chosen intentionally? If not, I'd be happy to help you change the filenames and function names.
@fgvanzee and @JerryMaoQC what is the impetus to include this in BLIS mainline?
@devinamatthews I suggested that @JerryMaoQC could submit it since there is no harm (that I could see) in having the extra sandbox directory there for posterity and in case others want to study and/or build on his work.
Sure. No objection I was just curious.