ucc icon indicating copy to clipboard operation
ucc copied to clipboard

Correct cleanup of internal resources associated with communicators

Open bosilca opened this issue 1 year ago • 3 comments

This issue is not entirely UCC related, but relates to UCC integration into an MPI library (in this case Open MPI).

Trigger: If an application calls MPI_Finalize without properly freeing all communicators there is no way to properly release UCC resources.

The root cause is the way UCC tracks OMPI communicators destruction, not by refcounting the collective referenced by the communicator but by adding an attribute (COMM_ATTR) on each communicator and expecting the attribute to be properly deleted (which will call into ucc_comm_attr_del_fn). This is the case when the user free a communicator via MPI_Comm_free, but the MPI standard does not required the release of attributes for non freed communicators in MPI_Finalize (at least not for attributes not associated with MPI_COMM_WORLD). The outcome is a disconnect between OMPI (where all communicators are released) and UCC where they are not.

A proper solution could be complicated to implement. It would require to refcount all the collectives used by OMPI, or to track all communicators internally and release them when the attribute deletion function (ucc_comm_attr_del_fn) is called on MPI_COMM_WORLD.

bosilca avatar Feb 15 '24 02:02 bosilca

@bosilca can we close this issue since https://github.com/open-mpi/ompi/pull/12429 was merged?

Sergei-Lebedev avatar May 31 '24 17:05 Sergei-Lebedev

Technically yes, but this is not completely over yet for UCC and HCOL. In open-mpi/ompi#12429 I have not removed the attribute on MPI_COMM_WORLD that both libraries are using to detect MPI_Finalize. I don't think it is needed anymore, but such a change must come from y'all.

bosilca avatar May 31 '24 19:05 bosilca

I see, let's keep this issue open then, thanks for update

Sergei-Lebedev avatar Jun 01 '24 09:06 Sergei-Lebedev