[SYCL][CUDA][MATRIX] joint_matrix_bmad implementation
cc @dkhaldi
Implementation corresponding to the matrix extension proposal section "Bitwise Multiply and Add" in https://github.com/intel/llvm/pull/4695
Integration tests here: https://github.com/intel/llvm-test-suite/pull/760
Hi @dkhaldi
If it is preferred for reviewing purposes I could add the temporary/initial fp19 implementation that uses uint32_t directly to this PR? Hopefully the uint32_t fp19 should be a bit more straightforward to review compared to the bmad cases, since in the end we realized we can implement the fp19 cases in a way which is completely compliant with the existing matrix extension, whereas the bmad cases require a different interface.
Otherwise it is fine to put them up one at a time, I just thought it might make it easier to review them at once.
Thanks
If it is preferred for reviewing purposes I could add the temporary/initial fp19 implementation that uses uint32_t directly to this PR? Hopefully the uint32_t fp19 should be a bit more straightforward to review compared to the bmad cases, since in the end we realized we can implement the fp19 cases in a way which is completely compliant with the existing matrix extension, whereas the bmad cases require a different interface.
I think separate PRs is better.
I think separate PRs is better.
OK
/verify with https://github.com/intel/llvm-test-suite/pull/760