ompi Add stream operations to accelerator components

trafficstars

First batch of changes from https://github.com/open-mpi/ompi/pull/12318 (offloading of reductions to devices).

This PR adds stream operations to the accelerator components:

~Default stream~
Stream-based alloc and free
Stream-based memmove
Wait for stream to complete

Also, enable querying for number of devices and memory bandwidth.

This PR is missing implementations for the ze component because I haven't had time to dig into that. Maybe someone familiar with that API can contribute the implementation? Otherwise I will need to find some time in the coming week(s) to implement them myself (the ze component didn't exist when I made these changes).

Feb 20 '24 21:02 devreal

Could you please add stub functions for the ZE component?

Feb 20 '24 21:02 hppritcha

@hppritcha I added stubs returning OPAL_ERR_NOT_IMPLEMENTED to the ze component.

Feb 20 '24 22:02 devreal

@devreal could you rebase on top of head of main to pull in the CI ZE compiler sanity check?

May 02 '24 15:05 hppritcha

I don't know the motivation behind the need for stream-ordered allocations for reductions. Can you explain the high-level picture?

How does MPI user pass streams to the MPI implementation? Without that ability, I'm not sure how stream-ordered memmove operations in accelerator component is taken advantage of.

Hi @devreal No more comments on the PR changes but I didn't get an answer for the above questions. Apologies if I've missed the response somewhere.

May 14 '24 18:05 Akshay-Venkatesh

@Akshay-Venkatesh Sorry, I only replied to your comments inline.

I don't know the motivation behind the need for stream-ordered allocations for reductions. Can you explain the high-level picture?

For operations, we may need to allocate temporary device buffers. Maybe we can work around that eventually but it probably doesn't hurt to have that functionality available.

How does MPI user pass streams to the MPI implementation? Without that ability, I'm not sure how stream-ordered memmove operations in accelerator component is taken advantage of.

There is no infrastructure for that yet (aside from some research projects). We can take advantage of it in the collective reduction operations though.

May 23 '24 00:05 devreal

The context is in the referenced PR (#12318) but that's a monster so it might not have been clear.

Based on the content of this PR I cannot figure out why this functionality is needed. What is the grand scheme in which we need to :

expose the memory bandwidth of a device ?

That is needed to decide whether to fetch the data to the host and perform the operator there or to submit a kernel. This will be part of a separate PR that has yet to come. I was hoping to get the basics in first but I can post all PRs and we synchronize across.

have an asynchronous memmove ?

See https://github.com/open-mpi/ompi/pull/12570. We could do without (i.e., use blocking memmove instead) and I'm not even sure how often memmove is actually used there but for the purpose of completeness I thought I'd be the right thing.

have an explicit synchronization routine instead of relying on existing mechanisms (create / record / and wait for an event to complete) ?

See https://github.com/open-mpi/ompi/pull/12570. Not sure how useful events would be there. Sounds like more overhead to create an event than to simply synchronize the stream. That's used to submit multiple data transfers and synchronizing once instead copying piecemeal.

May 23 '24 15:05 devreal

it would be good if we could have this PR merged in the near future, since this would simplify evaluating/testing the subsequent PRs

Jun 04 '24 13:06 edgargabriel

Will merge after resolve

Jun 11 '24 15:06 janjust

Rebased and fixed conflicts.

Jun 12 '24 22:06 devreal

ompi ompi copied to clipboard

Add stream operations to accelerator components

ompi
ompi copied to clipboard