mpich Sessions: integration of PMIx process sets

Pull Request Description

This PR is a contribution from ParTec AG, Germany. It is originally developed and tested with ParaStation MPI as part of the European High Performance Computing Joint Undertaking DEEP-SEA.

The PR adds two enhancements for MPI Sessions:

New infrastructure for process set management
Integration of PMIx process sets
- Make process sets accessible that are defined by the process manager, e.g., based on node attributes such as cluster or booster type
- Receive resource updates during runtime in an MPI Session, e.g., for malleability operations a new process set with a different size

We thought about splitting the enhancements into two separate PRs. However, based on the experience of our internal review and due to the tight coupling, it makes more sense to keep the discussions in one place.

1. New infrastructure for process set management

The goals of this enhancement are:

Adding a new memory class MPL_MEM_SESSION for memory allocations related to MPI Sessions
Introducing an MPIR_Pset data structure that contains information about a process set including its URI, size, members and validity status
Providing each MPI Session with its own view of available process sets
Providing routines that enable the initialization, destruction, addition, and invalidation of process sets for multiple sources of process sets (e.g. MPI standard and process manager)
Providing routines that manage access to process sets (count, access via URI or index) independent of their source

2. Integration of PMIx process sets

Concept and approach

The PMIx 4.1 standard provides a concept to define and delete process sets in the PMIx Server (see Sec. 13.1). These process sets are exposed to PMIx clients in two ways: Either by queries (see Ch. 5) or via the event notification mechanism of PMIx (see Ch. 9) upon definition and deletion of a process set.

For scalability reasons, our proposed solution is based on the event notification mechanism of PMIx. We believe that the number of queries required to keep a consistent view on PMIx process sets would lead to a large overhead in the PMIx server and (eventually) waiting times or inconsistencies on client side.

Concurrency

The event-based approach comes with the challenge of concurrency of the MPI ranks and event thread of the PMIx client library (one such thread per MPI process). Upon definition or deletion of a process set, the PMIx server emits an event. A client who registered for the event calls an event handler asynchronously via the event thread that is started by the PMIx client library upon startup (PMIx_Init).

To add a process set (define event) and to mark a process set as invalid (delete event) in MPI, the event handlers need to work on the global data structure MPIR_Process.pm_pset_array that is potentially used by MPI ranks concurrently. Hence, there are critical sections that need to be protected independent of the MPI being compiled as multi-threaded or not.

In future work, we plan to explore an alternative solution without the event thread in the PMIx client using external PMIx progress. First investigations in this direction lead us to the conclusion that we should not mix such a change of PMIx progress management with the features of this PR.

Requirements and testing

We tested our solution with OpenPMIx and the ParaStation Management process manager (min version 5.1.54).

Remark 1: Please note that PMIx 4 is required to compile and use the proposed solution. Conditional pragmas prevent the PMIx process set solution from being added to the MPI library if PMIx 3 or PMI(2) is used. In these cases, only the MPI default process sets are available using the new process set infrastructure described as enhancement 1 above.

Remark 2: It is not sufficient to have a PMIx 4-compatible process manager to actually "see" PMIx process sets in MPI. The PM has to define the process sets explicitly using the respective API of the PMIx 4 standard. ParaStation Management defines some process sets using the naming prefix pspmix: as of version 5.1.54.

Remark 3: If PMIx-based process sets are provided by the process manager, the init/session_psets test will test MPI Session routines for them without any changes to the test.

Author Checklist

[X] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
[X] Commits Follow Good Practice Commits are self-contained and do not do two things at once. Commit message is of the form: module: short description Commit message explains what's in the commit.
[ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
[X] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.

Mar 03 '23 14:03 sonjahapp

@sonjahapp Is there any measurement on the impact of PMIx event threads, for example, on a fully subscribed situation?

Mar 03 '23 17:03 hzhou

No, we do not have such measurements at the moment. But I do understand why you raise the question in the context of this PR. :-) It was the same for us at ParTec during the internal code review.

Our conclusion: We did not run into any issues with the PMIx event thread (for the thread function of OpenPMIx see here) up to this point and this PR is no reason to worry about it. The PMIx event (progress) thread is started in PMIx_Init(). This is not something that we add with this PR but it is already there if MPICH is used with PMIx.

The changes made by this PR let the PMIx event thread take action only if a process set event arrives. We expect process set events to be rare over the runtime of an application - otherwise something strange happens in the process manager and/or PMIx server. For example, currently ParaStation Management defines process sets only at the start of an application. The only time when the PMIx client's event thread invokes the callbacks for process sets in MPI is at the start of the application - so for sure outside of any performance critical code path.

In the future, when we have resource changes, e.g., due to malleability operations, there will be process set deletions and definitions during runtime. However, even then the occurrences of such resource changes are expected to be rare per application. We expect the impact of the PMIx event thread to be very small - even in situations where all processes are subscribed to the process set events.

We plan to look into the external progress option of PMIx (see PMIX_EXTERNAL_PROGRESS in PMIx 4.1 standard) that works without the event thread. This requires some thoughts on how and where to trigger progress of the PMIx client "manually" by calling PMIx_Progress(). Latest when we continue with this work, measurements on the impact of the PMIx progress will be required to better understand the constraints. Unfortunately, I cannot give you an ETA for this activity on our side... :innocent:

Mar 06 '23 11:03 sonjahapp

@sonjahapp A few notes --

If only the "mpi://WORLD" and "mpi://SELF" need be specialized to per-session, we could just single them out directly in the code as currently in main, right?
What is the mechanism for "WORLD" and "SELF" to change between sessions? We need this in the picture to consider the solution.
With PMIx set, are the only events "add" and "delete"?

Mar 09 '23 16:03 hzhou

@hzhou

If only the "mpi://WORLD" and "mpi://SELF" need be specialized to per-session, we could just single them out directly in the code as currently in main, right?

Yes, this would be possible. We did not keep it like that to have a more uniform indexing of the overall pset array of the session across multiple pset sources. Makes it easier to add more default psets in the future, should the MPI standard define some more.

What is the mechanism for "WORLD" and "SELF" to change between sessions? We need this in the picture to consider the solution.

WORLD and SELF psets should change between to session inits, if the number of processes have changed in the meantime, either through a shrink or an expansion of processes via the PM/scheduler. On MPI side, the idea is to call MPIR_pmi_init() for each session and store rank, size etc. per session in the future instead of using the global MPIR_Process structure. This will need some rework of the mpir_pmi module and other parts of the code. WIP...

With PMIx set, are the only events "add" and "delete"?

Yes. To be precise, the PMIx standard names the process set events PMIX_PROCESS_SET_DEFINE and PMIX_PROCESS_SET_DELETE (see Section 13.1.1 of the PMIx 4.1 standard).

Mar 10 '23 15:03 sonjahapp

@hzhou

If only the "mpi://WORLD" and "mpi://SELF" need be specialized to per-session, we could just single them out directly in the code as currently in main, right?

Yes, this would be possible. We did not keep it like that to have a more uniform indexing of the overall pset array of the session across multiple pset sources. Makes it easier to add more default psets in the future, should the MPI standard define some more.

OK. Can we simplify this PR by removing the MPIR_Pset_array class and instead just use a global ut_array directly? mpi://WORLD is special and I am not sure there is a benefit using the actual pset struct versus special case in code. For example, for most cases, a world of a million processes can be represented with just a size and rank, assuming a contiguous range. But with a general pset, we can't have that assumption and have to iterate the actual members array for potential gaps. Thus, I think treating these "built-in" pset the same as generic pset will have a rather negative impact.

Now if we treat builtin-pset as special, then we can simply set the members array in the MPIR_Pset struct NULL. Then we can simply put the "world" and "self" psets in the global ut_array. The actual malleability feature will be implemented separately anyway and the corresponding built-in pset struct does not need to be mutable.

Mar 10 '23 15:03 hzhou

if we treat builtin-pset as special, then we can simply set the members array in the MPIR_Pset struct NULL

Well... we could do that. But this would not be a simplification of the code. We would need to add a special case for members == NULL wherever we use the MPIR_Pset structure internally. I understand your concern about large scale world psets, but right now we are not really sure how big the impact will really be. Psets will (most probably) not be parsed frequently - even with malleability in the future. That's why we decided to propose a solution that uses a uniform code structure in favour of special-case treatment of built-in psets. Should we see that we need to be more careful in the future with the members list, we could optimize based on concrete experiences.

Then we can simply put the "world" and "self" psets in the global ut_array.

No, this will not work. As I explained above, the assumption is that world and self psets can change between two session inits. If we used a global pset array, such a change would not be possible per session. The only way to get rid of the two-level array would be to design a mechanism that tracks the existing sessions per process. We thought that this would go too far for now.

We can have a call to discuss this topic further. :-)

Mar 13 '23 09:03 sonjahapp

Rebased this PR to current main.

@hzhou What is the status of this PR on your side? What are the remaining open points/ questions? Would you like to have a call to discuss?

Apr 24 '23 10:04 sonjahapp

As we discussed offline, here are the action items:

Remove the Pset array for world and self, instead, add corresponding variables to the session struct itself
Use a single global pset array with its life time spans from init to atexit finalize.
Describe how a case of starting with 8 processes as a world, then have a new session with one (say rank 6) dropped and one new process joined.

Jun 15 '23 15:06 hzhou

@hzhou I've revised the PR based on our discussion and the action items.

1. Remove the Pset array for `world` and `self`, instead, add corresponding variables to the session struct itself

Done.

2. Use a single global pset array with its life time spans from init to `atexit` finalize.

Done. I refrained from using an atexit handler, but use a finalize callback instead to free the global pset array and avoid problems with the memory checks on finalize. The global pset array is re-created upon re-init.

3. Describe how a case of starting with 8 processes as a world, then have a new session with one (say rank 6) dropped and one new process joined.

Global ID mapping

MPI lib will use a PMIx Process Group to generate a global mapping from PMIx Process ID (namespace + rank) to MPI ranks. The mapping will be updated on every re-init (MPI_Session_finalize(sessionA) followed by MPI_Session_init(sessionB, ...)) by destroying the existing PMIx Process Group and constructing a new one collectively over all processes that are active after the re-init. The new PMIx Process Group will yield the new MPI rank and "job size" for every process.

This is currently work in progress at ParTec and NOT part of this PR. For this PR, we expect the PM to define PMIx pset(s) only at start-up and not during runtime. The steps described below will not work with the changes introduced by this PR. They are an outlook and describe how Psets will work with dynamic resource changes once the global ID mapping is in place and the PM defines/ deletes psets also during runtime.

Example

Notation for PMIx Process IDs: <namespace>_<rank>
a_3 means PMIx Process in namespace a with PMIx rank 3
PM defines a pset mypset that includes all processes across all namespaces

Starting conditions

Psets of `sessionA`	value
`mpi://WORLD`	all ranks from `0` to `7`
`mpi://SELF`	`sessionA->rank`
`pspmix/mypset_1`	PMIx: [`a_0,a_1,a_2,a_3,a_4,a_5,a_6,a_7`] MPI: [`0,1,2,3,4,5,6,7`] valid

Step 1: Drop rank 6/ PMIx proc `a_6`

Application re-organizes data so that rank 6 can leave safely
MPI_Session_finalize(sessionA), rank 6 exits
MPI_Session_init(sessionB, ...) in all remaining ranks

Psets of `sessionB`	value
`mpi://WORLD`	all ranks from `0` to `6`
`mpi://SELF`	`sessionB->rank`
`pspmix/mypset_1`	PMIx: [`a_0,a_1,a_2,a_3,a_4,a_5,a_6,a_7`] MPI: [`0,1,2,3,4,5,6,7`] invalid
`pspmix/mypset_2`	PMIx: [`a_0,a_1,a_2,a_3,a_4,a_5,a_7`] MPI: [`0,1,2,3,4,5,6`] valid

Step 2: Add one new process

We start this step from sessionB as noted above
Assumption: Additional resources to start new process are available
MPI lib calls PMIx_Spawn as part of resource expansion procedure, new process is spawned in new namespace b

Psets of `sessionB`	value
`mpi://WORLD`	all ranks from `0` to `6`
`mpi://SELF`	`sessionB->rank`
`pspmix/mypset_1`	PMIx: [`a_0,a_1,a_2,a_3,a_4,a_5,a_6,a_7`] MPI: [`0,1,2,3,4,5,6,7`] invalid
`pspmix/mypset_2`	PMIx: [`a_0,a_1,a_2,a_3,a_4,a_5,a_7`] MPI: [`0,1,2,3,4,5,6`] valid
`pspmix/mypset_3`	PMIx: [`a_0,a_1,a_2,a_3,a_4,a_5,a_7,b_0`] MPI: [`0,1,2,3,4,5,6,7`] valid

Application detects new Pset pspmix/mypset_3 in sessionB and starts adaptation
MPI_Session_finalize(sessionB)
MPI_Session_init(sessionC, ...)

Psets of `sessionC`	value
`mpi://WORLD`	all ranks from `0` to `7`
`mpi://SELF`	`sessionC->rank`
`pspmix/mypset_1`	PMIx: [`a_0,a_1,a_2,a_3,a_4,a_5,a_6,a_7`] MPI: [`0,1,2,3,4,5,6,7`] invalid
`pspmix/mypset_2`	PMIx: [`a_0,a_1,a_2,a_3,a_4,a_5,a_7`] MPI: [`0,1,2,3,4,5,6`] valid
`pspmix/mypset_3`	PMIx: [`a_0,a_1,a_2,a_3,a_4,a_5,a_7,b_0`] MPI: [`0,1,2,3,4,5,6,7`] valid

Application continues with sessionC and re-organizes data to include the new process in the computation

Jun 23 '23 12:06 sonjahapp

Thanks @sonjahapp . I drew following diagram for my understanding: session-psets Thus we will need a "global" (e.g. in MPIR_PMI) mapping facility to map from PMIx Process ID to world rank within a specific session. Each session need a pointer to a PMIx pset as the session's world pset. Thus all the PMIx psets need live in global outside sessions as well. Within a session, all the psets should be a subset of this session's world pset -- could you confirm? Then the psets within a session can be simply represented using world ranks, but they can be filtered and translated from the corresponding psets, thus there is no need to maintain psets within a session other than for caching purposes.

Anyway, I think this understanding is compatible with the current PR.

Jun 26 '23 16:06 hzhou

@hzhou Thanks for your comments and the sketch! Please find my comments below. Let me know if we should have another call to discuss.

Thus we will need a "global" (e.g. in MPIR_PMI) mapping facility to map from PMIx Process ID to world rank within a specific session.

The mapping of PMIx Process ID to MPI world rank is required on process level - not for a specific session. The mapping needs to be re-done after each malleability operation on process level. In the figure, please note that each session exists without the others. What you are sketching is a time propagation from top to bottom:

from session 1 (t=1)
to session 2 (t=2 where session 1 does no longer exists)
to session 3 (t=3, where neither session 1 nor session 2 exist)

The mapping of PMIx process ID to MPI world rank and the validity of all PMIx Psets is valid per process (not per session) at a defined point in time and all MPI sessions existing at that point of time use the current mapping and can access the psets. If we do malleability as in your figure, we assume that all sessions that exist at the time of the malleability operation must undergo the operation. The rank mapping of the processes will be updated as part of the malleability operation and afterwards the session(s) can be re-inited based on the new rank mapping and resume their work.

Background: It may look too radical to enforce all existing sessions to undergo a malleability operation. However, to avoid potential error sources and strange side effects (e.g. due to different rank numbering, appearing and disappearing processes in the sessions) when starting to experiment with malleability we go with this restriction for now. Based on the experiences and results that we will we gather, we might release this restriction partly in the future.

Proposal: I think (for now) we could refrain from putting rank and size into the MPIR_Session structure and continue to use MPIR_Process.rank and MPIR_Process.size to derive mpi://SELF and mpi://WORLD. This may need another look when we finally implement the malleability features but that's a topic for a future PR. :-)

Each session need a pointer to a PMIx pset as the session's world pset.

Why? Whatever psets are provided via PMIx by the PM is PM-dependent. Of course it would make sense to have a pset including all processes provided by the PM. But I see no strict requirement for this.

Thus all the PMIx psets need live in global outside sessions as well.

PMIx psets live on process level because that is the level on which PMIx defines and manages them. Any other solution would significantly increase the complexity of PMIx pset management on MPI side.

Within a session, all the psets should be a subset of this session's world pset -- could you confirm?

Yes, if you are referring to mpi://WORLD pset. All valid psets known to a session should be a subset of the mpi://WORLD pset known to the session and use the currently valid MPI world rank mapping.

Then the psets within a session can be simply represented using world ranks, but they can be filtered and translated from the corresponding psets, thus there is no need to maintain psets within a session other than for caching purposes.

Not sure if I understand the translation part. But I agree that we do not need to maintain psets within a session; mpi://SELF and mpi://WORLD can be derived from the world rank mapping and the PMIx psets live on process level anyway.

Jun 29 '23 13:06 sonjahapp

Let me know if we should have another call to discuss.

Yes, let's schedule another call.

Jun 29 '23 14:06 hzhou

Had an offline discussion, and here are the notes:

Malleability - dropping process - only happens after a process called MPI_Session_finalize and exit. Because MPI_Session_finalize is a collective call, there won't be an overlapping session that contains the dropped process afterward. This is critical to guarantee we'll have consistent views even with a single global pset array.
Malleability - adding process - only happens via MPI_Comm_spawn. Thus, new sessions won't have a world set include processes that spans multiple PMI namespace, i.e. won't include spawned processes. NOTE: I think we will need a way to spawn processes into a existing namespace to be feature complete.
We decided to have a global only growing pset array that each session will share. There will be invalid psets due to terminated processes. This is communicated to application via pset info hints. It is okay to have an empty global pset array if the PM doesn't support it.
There will be a mapping layer translating the pset member from "namespace-rank" into "session_rank"
For this PR, we'll add new .c file maintaining the global pset array. We'll add and expose utility functions to access the global pset data, thus do not expose the internal data structure.

Jul 03 '23 15:07 hzhou

Rebased on main and adapted to new mpir_pmi structure.

Jul 20 '23 13:07 sonjahapp

This PR was asleep for some time. I've rebased it on main.

Is there an interest to get the PMIx Pset support merged into MPICH after all?

If yes, I think we should discuss (offline?) how to deal with the PMIx event handler registration and the PMIx process set implementation in the context of MPICH's embedded PMIx implementation.

I would appreciate a short feedback on this. Thanks! :)

Apr 02 '24 12:04 sonjahapp

Sessions: integration of PMIx process sets

Pull Request Description

1. New infrastructure for process set management

2. Integration of PMIx process sets

Concept and approach

Concurrency

Requirements and testing

Author Checklist

Global ID mapping

Example

Starting conditions

Step 1: Drop rank 6/ PMIx proc a_6

Step 2: Add one new process

Step 1: Drop rank 6/ PMIx proc `a_6`