RAJAPerf
RAJAPerf copied to clipboard
hip mfma tests
This PR adds basic functionality test of leveraging the matrix cores on AMD gfx908 and gfx90a hardware for dense matrix products.
This is looking much better. The main thing to do now is to convert it to run in parallel on the gpu. I think its fine if what each thread does and the block size is different between the different tunings, as long as they're still similar enough to think of as different tunings of the same algorithm.