Matrix Transpose Tutorial Cleanup
I found a couple things while looking at the transpose tutorial.
First, the launch and kernel solutions could use block_unchecked policies. This will also allow the kernel implementation to skip the second sync threads call.
Second, it doesn't look like the launch solution actually uses shared memory as intended. It looks like the same thread that reads a value writes that value. The intention of shared memory is to let different threads read and write so memory accesses to both matrices are coalesced. This will require the launch solution to have a teamSync call, which it is currently lacking.
I think these examples were written before the unchecked policies. Are you looking at this example?
https://github.com/LLNL/RAJA/blob/develop/exercises/launch-matrix-transpose-local-array_solution.cpp#L304
We are missing a teamSync call. Do you have time to make PR's to fix them? -- it was probably a copy paste error.
It also appears that the hip version at least needs a synchronize after the hipMemcpy to get the right answer.