hipblaslt-bench performance improvements
This improves performance of hipblaslt-bench when timing a large number of matmuls.
There are 2 main changes
- Initialization of A, B and C matrices is done on the device instead of the host. This also avoids the hipMemcpy from host to device. But when a host copy is needed for the correctness check, the device matrix is still copied back to the host.
- A memory pool to reuse memory allocated with hipMalloc/hipMallocManaged or hipHostMalloc. This might slightly increase memory use.
Both changes together significantly improve performance. I have a test yaml input file with ~500K tests. With these changes, the code runs in ~42 minutes, compared to 8 hours before (on MI300).
I initially used rocRAND, but then removed this dependency. I can add that back if it is preferred over my own naive random number generator.
This is based on a commit by @bethune-bryant
Do we have the execution time of hipblaslt-test with/wo this PR?
With these changes ./hipblaslt-test took
real 19m58.491s
user 35m32.051s
sys 0m38.180s
while the develop branch took
real 21m12.295s
user 114m49.692s
sys 2m34.752s
On an MI210, 16 OpenMP threads.
Both changes together significantly improve performance. I have a test yaml input file with ~500K tests. With these changes, the code runs in ~42 minutes, compared to 8 hours before (on MI300).
In my testing it basically reduces total benchmark time to the sum of all kernel execution times, effectively removing all overhead.
Sorry, I had to make a few small fixes. But I think this is ready now.
@jichangjichang @KKyang Do y'all have any feedback on this?
Sorry, I got confused about num_batches[i] and block_count.
But I think it is correct now.
Can you please review this logic?
Do we have the execution time of hipblaslt-test with/wo this PR?
I can try to address execution time of hipblaslt-test in a follow up PR. For now we were focused on improving hipblaslt-bench time for timing a large number of matmul operations.