ompi icon indicating copy to clipboard operation
ompi copied to clipboard

opal/accelerator: Initial accelerator framework implementation

Open wckzhang opened this issue 3 years ago • 60 comments

Add new accelerator framework. Initial assumption is that only one accelerator component will be selected per node.

Added a cuda component Added a null component (no devices, performs no-ops/errors) Replaced a large chunk of common_cuda usage (it's not all unused, but it's mostly unused). Most likely broke ROCM usage until the ROCM component is written

Signed-off-by: William Zhang [email protected]

wckzhang avatar Mar 04 '22 00:03 wckzhang

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/45dd1add3d5625710fadc0c4315ac2fe

ibm-ompi avatar Mar 04 '22 00:03 ibm-ompi

The IBM CI (XL) build failed! Please review the log, linked below.

Gist: https://gist.github.com/6e2893f0ac301fcd63b998d1e888efca

ibm-ompi avatar Mar 04 '22 00:03 ibm-ompi

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/96d9cf89dde7cf1eba0c47721777ae58

ibm-ompi avatar Mar 04 '22 00:03 ibm-ompi

:+1: I have a proposal, let's drop the memcpy functions and use a vector_copy instead, defined as (count, blen, stride). This is the most basic type in the datatype, and any contiguous memory can easily be translated into this (stride = blen or something). For all the accelerators that support vector copy (e.g. cudaMemcpy2D) there will be a significant gain in performance. For everybody else it will not really matter, the vector_copy can simply do a loop around a memcpy.

bosilca avatar Mar 04 '22 17:03 bosilca

@bosilca, I like the idea, but I am wandering whether we will need maybe both, a regular memcpy and a vector_copy operation. I am thinking about integration with e.g. UCX, not sure how easy a change to vector_copy on that level would be.

edgargabriel avatar Mar 04 '22 19:03 edgargabriel

Edgar, isn't it just manually unroll the vector copy into single memcopys?

bwbarrett avatar Mar 04 '22 19:03 bwbarrett

@bwbarrett I am more worried about the fact that the vector_copy will have different signature than the memcpy function, and if e.g. UCX has to be adjusted to use this interface, it won't be able to be compiled with a previous version of ompi.

edgargabriel avatar Mar 04 '22 19:03 edgargabriel

actually, I take it back, maybe its not an issue. They are invoking pml_ucx_generic_datatype_pack/unpack which then calls opal_convertor_pack/unpack, and that interface would probably not change.

edgargabriel avatar Mar 04 '22 19:03 edgargabriel

While I understand the desire for comments, we shouldn't merge this PR until 1) the code changes to actually use it are in place and 2) there's at least one component. I'm not sure we can really be confident in the interface until we're at that point.

bwbarrett avatar Mar 08 '22 17:03 bwbarrett

Need to: Rework component selection logic Replace code that calls common_cuda that can call the framework device agnostically instead Do more testing

wckzhang avatar Mar 17 '22 23:03 wckzhang

Looks like the PR build checker failed due to jenkins losing connection, will fix itself on a re-run

wckzhang avatar Mar 18 '22 01:03 wckzhang

I changed the datatype engine, mtl's, common/ompio, osc/rdma, and mtls to use the new framework. The remaining dependencies on common_cuda.h are in:

  • coll_cuda component
  • nbc_internal
  • pml ob1
  • pml ucx
  • btl smcuda
  • btl tcp
  • rcache gpusm module

I don't plan to change (for now)

  • coll_cuda component
  • rcache gpusm module
  • pml ucx (Without feedback from someone with stake in ucx)

I think I'll remove the dependency in nbc_internal before I merge the patch. The remaining changes are more in depth changes in the ob1/btl layer.

wckzhang avatar Apr 07 '22 23:04 wckzhang

I think you need to fix nbc, pb1, and tcp before merging (ie, all the non-cuda-specific components). It's hard is precisely the reason to do the changes. We need to be 100% sure that we have the right interface.

bwbarrett avatar Apr 08 '22 02:04 bwbarrett

Added new asynchronous API's, Seth added a null component, rewrote the asynchronous progress engine in the ob1 pml, replaced all the OPAL_CUDA_SUPPORT ifdefs in the pml and btls (non cuda).

I'm pretty sure that things have broken for ROCM since a lot of conflicts appeared during the rebase. I didn't try fixing them as they will get addressed when Edgar writes the ROCM component.

wckzhang avatar May 16 '22 22:05 wckzhang

I forgot to add that you have a PMIx submodule change in the PR for no apparent reason; that's almost certainly a mistake.

bwbarrett avatar May 18 '22 00:05 bwbarrett

The failures look real; the datatype unit tests are failing with a segmentation fault.

bwbarrett avatar May 18 '22 00:05 bwbarrett

Failures are real, first failing commit: pml/ob1: Replace cuda code with accelerator code

a-szegel avatar May 18 '22 16:05 a-szegel

Initial Comment Cleanup: https://github.com/wckzhang/ompi/pull/9

a-szegel avatar May 18 '22 17:05 a-szegel

Would it be a win to have a accelerator.rst explaining the design, i.e., streams and events.

Sorry, I meant to say: would an introduction in accelerator.h help?

tschuett avatar May 18 '22 17:05 tschuett

The failures look real; the datatype unit tests are failing with a segmentation fault.

Yeah, I think I messed something up in my pml ob1 code conversion, I'll fix that

wckzhang avatar May 18 '22 20:05 wckzhang

Would it be a win to have a accelerator.rst explaining the design, i.e., streams and events.

Sorry, I meant to say: would an introduction in accelerator.h help?

I'll add some more help texts

wckzhang avatar May 18 '22 20:05 wckzhang

I noticed that the accelerator's components cleanup code never gets called. This is another thing for us to verify is working before this patch goes in.

a-szegel avatar May 18 '22 22:05 a-szegel

Can I query the accelerator framework for the number of devices of the current process? Or is the assumption that it is either zero or 1? Or do I have to query the underlying component?

tschuett avatar May 21 '22 19:05 tschuett

The accelerator framework doesn't quite isolate the application from the accelerator hardware. For CUDA, the user is responsible for setting up the CUDA context before initializing the framework (which will tell the framework which accelerator to use). It would be best if we could fix this breach in encapsulation, but we would have to modify the MPI api's which is not an option right now.

There is a loose assumption that there is 1 device per process, and we are still ironing this out (probably going to remove dev_id from accelerator.h).

a-szegel avatar May 23 '22 16:05 a-szegel

TODOs:

  1. rm -rf opal/cuda -> move all usages of cuda in OMPI to Accelerator Framework - IN PROGRESS
  2. ~~Figure out how to handle multiple Accelerators (of same type) on an instance, does dev_id belong in the interface?DONE~~
  3. ~~Fix Converter interface instead of messing with the converter flags - DONE~~
  4. ~~Add check capabilities interface to accelerator: bool check_capabilities( void *ptr, size_t size, uint32_t flags) + remove check_addr calls - Discarded~~
  5. ~~Figure out why accelerator cleanup code is never being executed - DONE~~
  6. ~~Make Commits bisectable from a runtime perspective, i.e.. fewer larger commits. DONE~~
  7. ~~Fix OFI MTL's configure.m4 - DONE~~
  8. ~~ompi/mca/mtl/ofi/mtl_ofi_component.c: printf is now going to always fire for providers that don't support FI_HMEM. - DONE~~
  9. ~~Add documentation about event / stream behavior to interface - DONE~~
  10. ~~opal/mca/accelerator/base/help-accelerator-base.txt: Output a list of every component that failed to be created <-- Done~~
  11. Remove accelerator_cuda_func_table
  12. ~~Add host memcpy & memmove functionality to NULL component + remove check_addr calls - DONE~~
  13. ~~Change create_stream to not allocate - DONE~~
  14. ~~Decide if accelerator_cuda_is_ipc_enabled() stays in the interface. DONE~~
  15. Add ROCm Component (Implemented by Edgar) <-- Edgar is doing this
  16. rm -rf opal/ROCm (Implemented by Edgar) <-- Edgar is doing this
  17. ~~Rename handle functions ipc_handle - DONE~~

wckzhang avatar May 25 '22 17:05 wckzhang

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/3677ad9fc8679de2828bd1432ecf7eb8

ibm-ompi avatar Jun 01 '22 21:06 ibm-ompi

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/272063d3dbd1389d04423973fde959a9

ibm-ompi avatar Jun 01 '22 21:06 ibm-ompi

The IBM CI (XL) build failed! Please review the log, linked below.

Gist: https://gist.github.com/3434b52b1662802e492cca83ade266b2

ibm-ompi avatar Jun 01 '22 21:06 ibm-ompi

  • What are the configure parameters to build OMPI with accelerator component? --with-cuda=<path_to_cuda>
  • What are the runtime parameters needed to toggle between accelerator instances? Not sure if you mean between different components or if you mean between different devices. There's no runtime switching of devices from mpi point of view (for now) Cuda contexts will be set by the application (MPI has bad communication with applications about GPU information) and Open MPI will perform tasks under the current cuda context

Framework questions:

  • Can you confirm that smcuda will continue to work with accelerator component? I will have to do some changes but it should work after I do those.
  • In the absence of smcuda, can cuda-MPI work with just the accelerator framework? If yes, which btl/mtl/pml combinations work? The (ofi mtl and pml cm) has been tested by me and works. Other than that I can test the tcp btl which would not work. One thing to consider is that smcuda was branched off of the old sm component, which performs a lot worse than the new sm component (formerly known as vader). For long term smcuda support, should consider rewriting it to be integrated with the new sm component, or at least branched off of the new component.

Some thoughts on this component's extensions that doesn't need to be part of this PR but the component should allow for in the future:

  • Does the framework allow for multiple GPUs to be selected by a single MPI rank? I didn't see device IDs being mentioned in accelerator component function signatures or data structures. No, with the way cuda contexts are set by the application (and rocm functions similarly), switching devices didn't really make sense as MPI doesn't have a lot of information other than the current cuda context. As far as I could tell, most cuda things didn't need a specified device id. I had formerly had device id in the API, but neither rocm nor cuda seemed to require it, and without a working implementation, it was hard for me to imagine which API's needed extension. When the need arises, we can add device_ids into the API.
  • Does the framework make assumptions that would not allow for stream extensions of MPI communication routines? There are some static streams setup for the purposes of htod/dtoh transfers for internal MPI use, but there should be provisioning for use of user provided streams/events.

There are functions for creating streams and events in the accelerator API. That's only for MPI internals though. Unfortunately MPI doesn't allow for much communication to the user about streams/events. We'd have to add some info keys but that would make it very Open MPI specific. There's a community that is trying to push for better information exchange in the MPI forum.

About merging the PR for OMPI 5.0 from NVIDIA's testing viewpoint:

There isn't currently a plan to merge the PR into OMPI 5.0, unless 5.0 is extremely delayed (which is not in our best interests anyway).

  • The PR hasn't been tested with ob1 + smcuda which is still used users owing to its relative stability and as a point of comparison with other options

My next update to this PR will update smcuda and will properly tested. I can do some testing with tcp + btl smcuda configurations.

  • The PR hasn't been tested with PML UCX which is the most used component for cuda-MPI workloads.

I haven't and probably won't be able to test with the PML UCX. I'm not too familiar with how the UCX PML cuda functionalities. I looked at ompi_pml_ucx.cuda_initialized but it seemed unused. I assume most of the cuda stuff happens under the covers in UCX itself, if you know more, can you elaborate?

wckzhang avatar Jun 23 '22 20:06 wckzhang

Is someone familiar with the mellanox CI? The failure looks real but I'm not sure what exactly it is

wckzhang avatar Jul 05 '22 15:07 wckzhang