HIP icon indicating copy to clipboard operation
HIP copied to clipboard

`hipEventRecord` time depends heavily on number of streams, synchronization, and HIP version

Open msimberg opened this issue 2 years ago • 5 comments

We're using HIP events to poll for the completion of kernels and to trigger continuations once a kernel completes in a linear algebra library (https://github.com/eth-cscs/DLA-Future). We heavily use rocBLAS/SOLVER for the linear algebra kernels. We also create many streams to increase concurrency when operating on different parts of matrices. Moving from HIP 5.2.X to 5.4.X we observed a large (> 2x) slowdown for some algorithms. From profiling we see that the kernels themselves actually did not slow down at all. Investigating further, and minimizing one of our tests, one of the culprits seems to be hipEventRecord. The time taken by hipEventRecord is very inconsistent and I believe it's at least one of the causes of the slowdown we're seeing. Since we're also launching many kernels we're especially affected by this slowdown.

https://gist.github.com/msimberg/6904f4b8b09ffbddace50a03b66af565#file-rocm_polling-cu contains a little test program that:

  • creates a number of streams
  • creates a number of events
  • records events on the streams (or a single separate stream)
  • waits for the events to complete Additionally, it can optionally launch an empty kernel and call hipDeviceSynchronize at the beginning of an iteration.

If I run the test program with (rocm_polling 6 256 5000 3):

  • 6 iterations, launch kernel and synchronize a kernel on the fourth iteration
  • 5000 events
  • 256 streams

with 5.2 I see a small performance change (improvement) in the test after synchronizing. In our own miniapps synchronizing at the start of the application doesn't seem to make a (big) difference. Typical output:

record: 0.0264502, total: 0.0269117
record: 0.0229541, total: 0.023369
record: 0.0227908, total: 0.0232039
synchronize
record: 0.0157944, total: 0.0162131
record: 0.015794, total: 0.0162075
record: 0.0157841, total: 0.0162085

With 5.4 typical output is:

record: 0.0286722, total: 0.0288229
record: 0.0280789, total: 0.0282264
record: 0.0276609, total: 0.0278028
synchronize
record: 0.00951174, total: 0.0159333
record: 0.00836649, total: 0.0158775
record: 0.00822481, total: 0.0158771

and in our own miniapps synchronizing at the start makes a big difference.

With 5.6 typical output is:

record: 0.0237216, total: 0.0239437
record: 0.0233935, total: 0.0236061
record: 0.022855, total: 0.023064
synchronize
record: 0.00617225, total: 0.0155317
record: 0.00525886, total: 0.0155088
record: 0.005132, total: 0.0154863

and in our own miniapps synchronizing at the start again makes a big difference.

Additionally, with 5.4 only synchronizing, but not launching a dummy kernel, only slightly improves the timings, i.e. not as dramatically as above with a dummy kernel. With 5.6 only synchronizing, but not launching a dummy kernel, makes no difference.

Obviously the test above is an imperfect proxy to the performance as it doesn't fully correlate with the performance we see in our algorithms, but the trends are exactly the same as we see. hipEventRecord may not be the only culprit, but e.g. hipEventQuery may also be involved.

Finally, the overheads seem to increase with the number of streams. Above there are 256 streams, but reducing it to approximately 50 mostly erases any difference between versions and synchronization/no synchronization.

We're seeing this on an MI250x system (specifically, the GPU partition on LUMI; rocminfo output). I make only one device visible from one MPI rank for the test.

The driver version is 5.16.9.22.20.

The 5.2 version is (this is the default available on LUMI):

> hipcc --version
HIP version: 5.2.21153-02187ecf
AMD clang version 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.2.3 22324 d6c88e5a78066d5d7a1e8db6c5e3e9884c6ad10e)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.2.3/llvm/bin

5.4 is (installed through spack):

HIP version: 5.4.22804-
AMD clang version 15.0.0
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /projappl/project_465000105/simbergm/spack-install-tree/23.03/0.19.2/llvm-amdgpu-5.4.3-fsdty35/bin

5.6 is (installed through spack):

HIP version: 5.6.31062-
AMD clang version 16.0.0
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /projappl/project_465000105/simbergm/spack-install-tree/23.03/0.19.2/llvm-amdgpu-5.6.1-4u5wf4a/bin

What am I observing here? Why is there such a strong dependence on the number of streams? Is it expected that hipStreamSynchronize or a dummy kernel would affect the timings this strongly? More practically, is there a reliable way we can guarantee the "good" behaviour (are there other HIP runtime functions that we can reliably call to force the "good" behaviour above)? Do you expect changes in the performance of events in the future?

msimberg avatar Nov 22 '23 11:11 msimberg

I think that the dependence on the number of created streams (even if you do not use any of these) can be explained by the fact that during hipEventRecord(events[i] , stream) the runtime needs to synchronize with the null stream. The way this is currently implemented it iterates through the whole set of created streams in order to get the right stream to synchronise with (the nullstream in this case).

To confirm that this is the case and as a possible workaround, can you try to use a non NonBlocking stream in hipEventRecord i.e create the stream with: check_error(hipStreamCreateWithFlags(&stream, hipStreamNonBlocking));

This indicates that the stream does not need any synchronisation with the nullstream (there is no work submitted to the nullstream here anyway) and the runtime will skip the iteration through the large stream set. This can be faster and the timings should be constant with increasing number of streams.

iassiour avatar Nov 26 '23 23:11 iassiour

Thanks @iassiour for the quick response.

The way this is currently implemented it iterates through the whole set of created streams in order to get the right stream to synchronise with (the nullstream in this case).

That's a bit of an unfortunate but not much that can be done about it for released versions. Do you think there's a chance this could be fixed to be a constant time operation in future releases (and if so, should I open a separate issue for that somewhere)? I understand this may have been sufficient for most use cases, but given a system with 8 GPUs and say 16 streams for each GPU you're already up to 128 streams. We're lucky in this case to use only one MPI process per GPU, so we can limit the number of streams, but it would be interesting to explore making use of all GPUs on a node from a single process in the future and then I see this becoming a bottleneck again.

To confirm that this is the case and as a possible workaround, can you try to use a non NonBlocking stream in hipEventRecord i.e create the stream with: check_error(hipStreamCreateWithFlags(&stream, hipStreamNonBlocking));

I'll give this a try, thanks for the hint! We've actually had that enabled before, but had to disable it due to issues with rocBLAS. I'm hoping we do not see those issues anymore.

Any ideas about what's going on with hipDeviceSynchronize and just launching a kernel?

msimberg avatar Nov 28 '23 08:11 msimberg

Hi @msimberg

Do you think there's a chance this could be fixed to be a constant time operation in future releases (and if so, should I open a separate issue for that somewhere)?

I have raised an internal ticket to track the issue. We will work towards optimising this use case for future release.

Any ideas about what's going on with hipDeviceSynchronize and just launching a kernel?

Launching the kernel has the side-effect to create the nullstream at that point. If the nullstream exists the search during hipEventRecord stops earlier otherwise it continues to the end of the streamSet.

Calling hipDeviceSynchronize only (without the kernel) does not seem to have an effect in my tests. I do see that the last 3 iterations are marginally faster maybe because of cache effects but running the test with more iterations it seems that changing the synchronize_index does not make any difference.

iassiour avatar Nov 29 '23 01:11 iassiour

@msimberg Do you still need assistance with this ticket? If not please close ticket. Thanks!

ppanchad-amd avatar Apr 29 '24 15:04 ppanchad-amd

@ppanchad-amd has something in HIP been changed to change the behaviour? It'd be interesting to be notified through this issue once the internal ticket mentioned by @iassiour has made it into a release.

msimberg avatar Apr 30 '24 06:04 msimberg

Hi @msimberg a fix for this has been added with this change: https://github.com/ROCm/clr/commit/8bdda9007104ddd90f8a77955821d2c765979ad3

And it has made it into the 6.1 release. Could you please confirm that it resolves the issue.

iassiour avatar May 29 '24 12:05 iassiour

Thank you @iassiour for the update! I'll indeed give it a go and report back. I'll close this for now with the assumption that it's fixed.

msimberg avatar May 29 '24 15:05 msimberg