ompi MPI_Comm_create_from_group works incorrectly when used with non-overlapping groups but same stringtag argument

Turns out the way MPI_Comm_create_from_group is using the PMIx_Group_construct method doesn't correctly handle the case where multiple MPI processes are invoking MPI_Comm_create_group using different, non-overlapping MPI groups, but the same stringtag argument.

The wording in the MPI-4 standard does not indicate this is incorrect:

If a nonempty group is specified, then all MPI processes in that group must call
the function and each of these MPI processes must provide the same arguments, including
a group that contains the same members with the same ordering, and identical stringtag
value.

@dalcini

Oct 05 '22 17:10 hppritcha

I'm marking this as a blocker because its solution will likely involve changes to a parameter in the mpi.h.in header file - which is incorrect in any case for the maximum allowed stringtag length.

Oct 05 '22 17:10 hppritcha

So you are interpreting that statement as only requiring that all processes calling the function must pass the same stringtag, but not that the stringtag must be unique - yes? That strikes me as an oversight in the standard that needs to be addressed as it frankly makes no sense. Why "tag" a group if it isn't unique? What possible purpose could the tag serve in that case?

Oct 05 '22 17:10 rhc54

Screen Shot 2022-10-05 at 11 41 19 AM

Oct 05 '22 17:10 hppritcha

The stringtag is not about identifying the group, because otherwise the MPI standard would be standardizing how external software would have to name their groups. Instead, the stringtag is an identifier for the operation on this particular group. To understands why this is necessary one must look at the API of call, and notice that this API does not have a base communicator, which means there is no dedicated communication channel for this operation, aka. that the only way the processes can order concurrent calls to this function is by uniquely identifying them with the tuple group and tag. Without taking both in account you will either prevent concurrent calls to overlapping groups, or unordered calls with the same group.

Oct 05 '22 17:10 bosilca

This goes back to the above figure a process sets. We have a blurb in the MPI-4 standard, chapter 11, that states

If two MPI processes get the same process set name, then the intersection of the two 
process sets shall either be the empty set or identcial to the union of the two process sets.

So the MPI://SELF process set is an example of the former in the above statement, so I could see how a programmer may interpret that to mean that there's no reason for the stringtag to be unique if the group handle argument to the MPI_Comm_create_from_group came from non-overlapping process sets.

Any rate, I plan to fix this by appending the vpid of the first proc in the group to the supplied stringtag before turning over to PMIx_Group_construct. Thank goodness we retained the verbiage about restrictions on the group argument concerning order of processes in the group.

Oct 05 '22 17:10 hppritcha

Thank goodness we retained the verbiage about restrictions on the group argument concerning order of processes in the group.

What would have been the order of processes in the new communicator otherwise ?

Oct 05 '22 18:10 bosilca

The stringtag is not about identifying the group,

I see - thanks for clarifying!

Any rate, I plan to fix this by appending the vpid of the first proc in the group to the supplied stringtag

I grok what you are saying, but that still doesn't result in a unique PMIx group name since the proc might be involved in multiple MPI_Comm_create_from_group calls and there is no guarantee that the user will only put that proc once in the first position, is there?

Oct 05 '22 19:10 rhc54

Couple of responses/questions, but I'll post them separately.

What would have been the order of processes in the new communicator otherwise ?

I have to agree with this question. If a collection of procs calls this function using arbitrary ordering, then it isn't clear to me what the resulting communicator "rank" should be for the participants. If we follow typical MPI logic, it would be the position of the proc in the provided array - but that would differ across the collection. So it would seem that the ordering must be the same if you want to result in a usable MPI communicator, yes?

Oct 05 '22 20:10 rhc54

@hppritcha I wonder if we are getting caught in a confusion caused by overloading of the "group" term. Are you equating a given MPI communicator with a specific PMIx group? If so, that has some implications that might not be acceptable to the MPI community. Biggest difference is that we have unique string "names" for a group that helps us distinguish between them, even if they have the same membership. You lack that in your communicators and I'm not sure how you'd add it.

If two MPI processes get the same process set name, then the intersection of the two process sets shall either be the empty set or identcial to the union of the two process sets.

If this refers or attempts to correlate with PMIx process sets, then that wouldn't be correct. Because of the dynamic process set support, a process (in PMIx parlance) doesn't belong to just one process set. So two processes could find themselves members of the same process set, but that set could be completely orthogonal to any other set to which they belong.

Again, it may be that the terminology unfortunately overlaps and causes confusion. I'm only commenting out of concern regarding what PMIx can and cannot support, wanting to ensure that we do support whatever you are attempting to do.

Oct 05 '22 20:10 rhc54

@rhc54 getting side tracked today I'll answer questions here tomorrow. we really are only using PMIx groups as a rendezvous mechanism not as something for tagging comms, groups, etc.

Oct 05 '22 22:10 hppritcha

Gotcha - will look forward to your answers when you have time. FWIW, I attempted to clear up some terminology confusion here (https://github.com/open-mpi/ompi/pull/10886#issuecomment-1269027589) that might be relevant in this topic as well.

Oct 05 '22 22:10 rhc54

@rhc54 the only connection that the MPI communicator generated from MPI_Comm_create_from_group has to a PMIx group is the PMIX_GROUP_CONTEXT_ID we got back from the PMIx_Group_construct call. We use the PMIx group constructor/destructor much like in the pmix example - group_lcl_cid.c, so by the time we return from the call, the pmix group has been destructed. I think as long as we're using PMIx groups in this way we shouldn't have a problem.

Note we do have an advice to users that reads thus:

Advice to users. The stringtag argument is used to distinguish concurrent communicator
construction operations issued by different entities. As such, it is important
to ensure that this argument is unique for each concurrent call to
MPI_COMM_CREATE_FROM_GROUP. Reverse domain name notation convention [1]
is one approach to constructing unique stringtag arguments. See also example 11.10.
(End of advice to users.)

We may want to beef this up some for MPI 4.1.

Oct 06 '22 16:10 hppritcha

Ah, okay - that should work then so long as you don't have two concurrent calls that include the same vpid in the first position as that would result in conflicting group IDs. Still noodling over possible solutions - or is that an illegal scenario under MPI? I wonder if OMPI couldn't just generate a unique tag internally, maybe using some thread-safe internal counter that you integrate into the tag - e.g., "create-from-group-N" for the nth call?

Oct 06 '22 16:10 rhc54

That approach may be the way to go. we do that already for the mpi_intercomm_create_from_groups method.

Oct 06 '22 16:10 hppritcha

got clarification from @rhc54 about potential max length of grp string arg to PMIx_Group_construct. seems to be a mistake in some older versions of the PMIx standard. removing the blocker status here as there will be no need to update mpi.h.in.

Oct 06 '22 18:10 hppritcha

@hppritcha Sorry, I missed this issue, I was traveling. BTW, my GH username is @dalcinl, with final "L" (as in Lima) , not "I" (as in India). Let me know once you have PR to try out your fix.

Oct 06 '22 23:10 dalcinl

@hppritcha Been thinking about it some more and the indexing approach to the stringtag passed to PMIx_Group_connect isn't going to work. Problem is that all the procs must call your MPI function, and they all must pass the same string tag to PMIx_Group_connect so we can properly perform the rendezvous. The index isn't guaranteed to be the same across all procs, so that doesn't work.

What I would suggest is that you use the procs themselves as the tag. For example, suppose that you have ranks 0,1,6 from nspace foo and ranks 2,8 from nspace bar executing your comm_create call. Then a tag like this would work:

cfg:foo:016:bar:28

I only added the colons for clarity - no reason to include them. Remember, the tag doesn't have to be parsable - it just has to be unique. I put a symbol identifying the MPI operation to separate this from the same procs participating in some other MPI operation.

Note that this also assumes the ordering of procs in the call is consistent across the procs calling the function - but I think MPI is going to require that anyone as per the above discussion. It also assumes that the same procs cannot engage in multiple concurrent calls to the same MPI operation - not sure if that is correct as perhaps different threads from within the same procs could execute it?

Oct 07 '22 15:10 rhc54