hipBLASLt hipblaslt-bench performance improvements

This improves performance of hipblaslt-bench when timing a large number of matmuls.

There are 2 main changes

Initialization of A, B and C matrices is done on the device instead of the host. This also avoids the hipMemcpy from host to device. But when a host copy is needed for the correctness check, the device matrix is still copied back to the host.
A memory pool to reuse memory allocated with hipMalloc/hipMallocManaged or hipHostMalloc. This might slightly increase memory use.

Both changes together significantly improve performance. I have a test yaml input file with ~500K tests. With these changes, the code runs in ~42 minutes, compared to 8 hours before (on MI300).

I initially used rocRAND, but then removed this dependency. I can add that back if it is preferred over my own naive random number generator.

This is based on a commit by @bethune-bryant

Oct 03 '24 20:10 pghysels

Do we have the execution time of hipblaslt-test with/wo this PR?

Oct 04 '24 07:10 jichangjichang

With these changes ./hipblaslt-test took

real    19m58.491s
user    35m32.051s
sys     0m38.180s

while the develop branch took

real    21m12.295s
user    114m49.692s
sys     2m34.752s

On an MI210, 16 OpenMP threads.

Oct 04 '24 22:10 pghysels

Both changes together significantly improve performance. I have a test yaml input file with ~500K tests. With these changes, the code runs in ~42 minutes, compared to 8 hours before (on MI300).

In my testing it basically reduces total benchmark time to the sum of all kernel execution times, effectively removing all overhead.

Oct 05 '24 04:10 bethune-bryant

Sorry, I had to make a few small fixes. But I think this is ready now.

Oct 07 '24 16:10 pghysels

@jichangjichang @KKyang Do y'all have any feedback on this?

Oct 07 '24 19:10 bethune-bryant

Sorry, I got confused about num_batches[i] and block_count. But I think it is correct now. Can you please review this logic?

Oct 09 '24 19:10 pghysels

Do we have the execution time of hipblaslt-test with/wo this PR?

I can try to address execution time of hipblaslt-test in a follow up PR. For now we were focused on improving hipblaslt-bench time for timing a large number of matmul operations.

Oct 10 '24 16:10 pghysels