Add GEMM device function
This adds a gemm device function which is callable in other kernels. The function is designed to be called by an entire wavefront and to compute a block of the output matrix. Therefore, problems can be decomposed into chunks that are operated on by individual wavefronts. Currently, it is used to implement an alternative rocsolver_gemm kernel.
- Uses mfma instructions
- Limited to
__gfx90a__,__gfx940__,__gfx941__, and__gfx942__architectures - Supports both complex data types and transpose matrix operations
Hi @AGonzales-amd, I second Ed's suggestion: you should update rocsolver-test and -bench clients to support the internal gemm. I can help you with the tests if you want (this would also be a good opportunity to provide a concrete answer to the question you asked in #879).
Hi @AGonzales-amd, I second Ed's suggestion: you should update
rocsolver-testand-benchclients to support the internal gemm. I can help you with the tests if you want (this would also be a good opportunity to provide a concrete answer to the question you asked in #879).
Thanks @jmachado-amd, I could use your help. I did consider updating the client programs but I had trouble exporting the function or making it visible in the clients.
Hi @jmachado-amd and @EdDAzevedo, the client programs have been updated to support internal gemm. One thing I'm not sure about is the tolerance for error checking.
Hi @AGonzales-amd, as long as the input matrices are "small" to "medium" sized and have positive, relatively small integer entries, the current test tolerance will work just fine! Let me know if you want to generalize the tests or just understand how those bounds work, and I can explain the important bits of the theory to you.
On another topic, I see that there are many gemm tests failing in Windows, you probably want to have a look at that sooner rather than later.
Test failure in SYGVDX likely unrelated. Forcing the merge.