Support full pre-packing of matrices A and/or B
Provide functionality to pre-pack entire matrices (A or B), which can then be passed into the framework and computed upon while bypassing the traditional matrix packing stages. This will require changes to the packing code, and perhaps the obj_t to support the more complicated indexing needed to store a fully-packed matrix, as well as the internal back-ends to ensure that the packing nodes of the control tree are skipped during computation.
Hi FIeld, I find this a good optimization idea, with two considerations :
-
If we pack out-of-place, i.e malloc new buffer and pack the entire matrix to it, we can do it under-the-hood but it will consume more memory and may turns out impossible on some platforms.
-
If we pack in-place, we should give user the choice to allow BLIS overiding his/her input matrices, because (s)he may need them after to do other things. Who knows ?
I think about extending the type trans_t so that user can choose the appropriate option :
- BLIS_?_TRANSPOSE_PREPACK_OUT_PLACE
- BLIS_?_TRANSPOSE_PREPACK_IN_PLACE
If I understand @fgvanzee's intentions correctly, the idea is to have a separate prepack function(s) so that one could do something like:
double* fixed_A = ...;
double* packed_A = ...;
double** Bs = ...;
double** Cs = ...;
bli_hypothetical_prepack_A(..., fixed_A, packed_A);
for (int i = 0;i < N;i++)
{
bli_dgemm(..., packed_A, ..., Bs[i], .... Cs[i]);
}
and so avoid the repeated cost of packing A. @hominhquan: of course, prepacking the entire matrix during a single operation may also be useful, as it also avoids some packing cost (except when the matrix is too large for the LLC), but due to the interface issues that you mention I think this is not the primary goal. Is this second use case important to you? If so perhaps we could reevaluate how this might work.
I should also mention that the above example is logically a tensor contraction (A is 2-D, B and C are 3-D). You can also avoid the cost of repeatedly packing A (up to the point where B overruns nc) by doing this as a single operation a la TBLIS.
FYI Intel MKL has this functionality already. You can call xgemm_pack(...) to pack matrices and then xgemm_compute(...) to compute with them.
It might be nice to export the same interface that MKL does.
Documentation below: https://software.intel.com/en-us/mkl-developer-reference-fortran-gemm-compute https://software.intel.com/en-us/mkl-developer-reference-fortran-gemm-pack
OK, I didn't understand it as a separate function in API. When it is a separate function like on MKL, I agree that it will be useful to user.
And I do not have special need of the second use-case with TRANSPOSE, for the moment.
Thanks for your comments, folks. I don't have any objections to including the MKL-based API in the BLAS compatibility layer, but we should definitely have native interfaces that they build upon, as is already the case with all of the other computational routines in BLIS.
(BTW, sorry for my delayed response.)
Hi @fgvanzee, Just wanted to know if anything has been done/achieved on this topic since ? I begin to be interested in this feature and am eager to get in if nothing done yet :-)
Quan
@hominhquan No, this has not yet risen to a priority level that merits my time (versus the other priorities we juggle).
Honestly, when I re-read this thread and only became more confused by the differences between various people's interpretations, and whether I properly understood those interpretations.