hipBLASLt icon indicating copy to clipboard operation
hipBLASLt copied to clipboard

hipblaslt-bench performance improvements

Open pghysels opened this issue 1 year ago • 2 comments

This improves performance of hipblaslt-bench when timing a large number of matmuls.

There are 2 main changes

  • Initialization of A, B and C matrices is done on the device instead of the host. This also avoids the hipMemcpy from host to device. But when a host copy is needed for the correctness check, the device matrix is still copied back to the host.
  • A memory pool to reuse memory allocated with hipMalloc/hipMallocManaged or hipHostMalloc. This might slightly increase memory use.

Both changes together significantly improve performance. I have a test yaml input file with ~500K tests. With these changes, the code runs in ~42 minutes, compared to 8 hours before (on MI300).

I initially used rocRAND, but then removed this dependency. I can add that back if it is preferred over my own naive random number generator.

This is based on a commit by @bethune-bryant

pghysels avatar Oct 03 '24 20:10 pghysels

Do we have the execution time of hipblaslt-test with/wo this PR?

jichangjichang avatar Oct 04 '24 07:10 jichangjichang

With these changes ./hipblaslt-test took

real    19m58.491s
user    35m32.051s
sys     0m38.180s

while the develop branch took

real    21m12.295s
user    114m49.692s
sys     2m34.752s

On an MI210, 16 OpenMP threads.

pghysels avatar Oct 04 '24 22:10 pghysels

Both changes together significantly improve performance. I have a test yaml input file with ~500K tests. With these changes, the code runs in ~42 minutes, compared to 8 hours before (on MI300).

In my testing it basically reduces total benchmark time to the sum of all kernel execution times, effectively removing all overhead.

bethune-bryant avatar Oct 05 '24 04:10 bethune-bryant

Sorry, I had to make a few small fixes. But I think this is ready now.

pghysels avatar Oct 07 '24 16:10 pghysels

@jichangjichang @KKyang Do y'all have any feedback on this?

bethune-bryant avatar Oct 07 '24 19:10 bethune-bryant

Sorry, I got confused about num_batches[i] and block_count. But I think it is correct now. Can you please review this logic?

pghysels avatar Oct 09 '24 19:10 pghysels

Do we have the execution time of hipblaslt-test with/wo this PR?

I can try to address execution time of hipblaslt-test in a follow up PR. For now we were focused on improving hipblaslt-bench time for timing a large number of matmul operations.

pghysels avatar Oct 10 '24 16:10 pghysels