ompi icon indicating copy to clipboard operation
ompi copied to clipboard

Issues with CUDA accelerator component initialization

Open devreal opened this issue 2 years ago • 8 comments

We're working with the CUDA accelerator component and tried to rebase my somewhat outdated branch to current main. I believe I found an issue with the way the CUDA component is initialized: Since https://github.com/open-mpi/ompi/commit/ae98e0453252cbf026dbd784f618d320ef1066fc we call cuInit in accelerator_cuda_init but do not set a context. Then in every call to opal_accelerator_cuda_delayed_init henceforth (until the first call to a CUDA function by the application) we receive a NULL context from cuCtxGetCurrent and return an error (https://github.com/open-mpi/ompi/blob/main/opal/mca/accelerator/cuda/accelerator_cuda_component.c#L146). That prevents all other accelerator-related state in OMPI from properly initializing. On this particular system, at least smcuda (mca_btl_smcuda_accelerator_init) and ob1 (mca_pml_ob1_accelerator_init) do not enable accelerator support because they cannot create a stream, unless the application does call into CUDA before calling MPI_Init (because there will be a CUDA context in that case). Is this what we want?

Interestingly, before https://github.com/open-mpi/ompi/commit/ae98e0453252cbf026dbd784f618d320ef1066fc we would not return an error from opal_accelerator_cuda_delayed_init (because cuCtxGetCurrent returned an error code) and so the accelerator support would work properly.

I believe the same behavior exists in the 5.x release branch.

devreal avatar Jul 20 '23 20:07 devreal

I've been trying to use 5.0rc's on a cluster with CUDA devices and reported a bug previously about broken sm+ob1 support. How does CUDA support keep breaking in 5.0rc's? Is there CI for this?

BenWibking avatar Jul 21 '23 22:07 BenWibking

@BenWibking Is this the issue you are referring to? https://github.com/open-mpi/ompi/issues/10871

devreal avatar Jul 23 '23 09:07 devreal

@BenWibking Is this the issue you are referring to? #10871

I was referring to https://github.com/open-mpi/ompi/issues/11399, which is now fixed.

BenWibking avatar Jul 23 '23 15:07 BenWibking

@devreal I thought this was addressed with #11297 I don't think this is the intended behavior. The delayed_init was introduced specifically to avoid this scenario.

@BenWibking No CI atm, at least in external MTT.

janjust avatar Jul 27 '23 16:07 janjust

We did discuss this ticket in the last meeting and potential resolutions. The main reason for pr https://github.com/open-mpi/ompi/commit/ae98e0453252cbf026dbd784f618d320ef1066fc was the change to the coll/cuda component (i.e. coll/cuda being always compiled and wanting to make sure that it disqualifies itself if no GPUs are present). That change to coll/cuda has however not been ported to 5.0. So in theory if we want we can simply revert this pr on 5.0 without any consequences.

edgargabriel avatar Aug 04 '23 15:08 edgargabriel

Possibly the change I made to cuda_hmem_verify_devices() in this libfabric PR would be a helpful direction to explore: https://github.com/ofiwg/libfabric/pull/9170

qkoziol avatar Aug 08 '23 15:08 qkoziol

As long as any component requires the device component to be up during its initialization (like ob1) the delayed init of the CUDA component is useless (not sure how that ever worked). One fix would be to delay the comm component's device setup (i.e., streams) until we see the first communication. I guess this was the intent of https://github.com/open-mpi/ompi/pull/11253 but it missed the stream creation.

devreal avatar Aug 14 '23 20:08 devreal

Discussed on developers call today. @bosilca reminded us that this issue should really target milestone 5.0.1, as the delayed init code is in main but not v5.0.x branch. I'm changing the target tags accordingly.

lrbison avatar Aug 15 '23 15:08 lrbison

https://github.com/open-mpi/ompi/commit/ae98e0453252cbf026dbd784f618d320ef1066fc has been reverted in https://github.com/open-mpi/ompi/pull/12157.

Note that this change was only on main. v5.0.x does not have this issue.

wenduwan avatar Apr 24 '24 21:04 wenduwan