ompi Proposed changes to the mapping/rank/bind system

This was sent in an email by @rhc54. Posting it here for discussion:

I've been working on updating the map/rank/bind system to address several reported problems - making good progress. However, something I would like to do during this cleanup is to simplify the code for future maintenance. So I have a few things to pass along and ask about:

does anyone still care about the mindist mapper? I never see questions about it, so it isn't clear to me that anyone (a) even knows it exists and (b) uses it. It was originally written by Mellanox for OMPI, but my understanding is that Mellanox recommends that their users direct launch via srun and not use mpirun. If nobody cares, then I'll just whack that mapper.
I am considering reducing the number of ranking options. I need to preserve the ability to generate a pattern whereby you load balance (mapping by a particular object so a proc gets placed per-object) but rank sequentially (so ranks 0 and 1 are on the same object, etc). However, it isn't clear that anyone has ever used the general combinations (e.g., map-by numa, rank-by node). Those oddball combinations require a lot of code and are really slow to execute, so I'm considering removing them. Anyone care?

Note that this also allows me to map/rank/bind in a single loop, thus greatly accelerating that procedure - a key issue with the workload mgmt community. I could maintain logic that redirects us to the old slower methods when necessary - would just like to remove all that code if possible.

I am going to modify the mapping option a bit by changing "pe-list=x,y,z" (a.k.a. "cpu-list") from a qualifier to an option. We currently map-by whatever object someone specified, and then remove all the objects that don't include the specified cpus. This makes no sense and has been subject to multiple bugs in the past. Makes more sense to simply map-by those cpus from the start, mapping all procs at the node level and then binding them either "ordered" (i.e., first proc goes to the first cpu on the list, etc) or to the full pe-list. Anyone have an objection?

Jun 24 '22 03:06 awlauria

@abouteiller had mentioned that some of the oddball rankers can sometimes be useful for getting collectives to work well. The example was in fact the "map-by numa rank-by node" one. However, I'm wondering if that is really correct and wanted to dig a little deeper into that notion.

I'm assuming that one fiddles with things like the rank vs map settings to try and get the collective to better optimize its communication pattern. However, it seems to me that this logic hinges on a quite likely false assumption - i.e., that two nodes can be considered communication close if they are adjacent in the map. In other words, one is operating under the assumption that node N is communication near to node N+1 in the allocation (since the map follows the ordering of the allocation). It would be great if this were true, but in reality it is only going to be true in a very narrow set of scenarios.

If one truly wants to optimize a communication pattern, one needs to actually get either the communication cost matrix for the allocation, or get the connectivity map showing the node-to-switch and switch-to-switch connections. PMIx provides interfaces for obtaining both of those, although the info is only available for fabrics whose vendors have provided plugin support. Intel is working to provide such support for Slingshot, and I believe HPE is providing it via their PALS system independent of a PMIx plugin.

It is quite possible other vendors will eventually follow suit. Meantime, my point here was simply to raise awareness that attempting to use map/rank settings to optimize a communication pattern makes assumptions that are frequently incorrect. Even if you found a benefit for a particular allocation (because it just so happened that the mapping and allocation came out "right"), it will quite likely prove non-beneficial for a different allocation on the same machine - and almost certainly will be non-beneficial on other environments.

Jun 25 '22 22:06 rhc54

I am certain no code currently in OMPI optimizes the collective communication pattern based on rank closeness, the code stays away from such assumptions because as you mentionned they are most certainly incorrect. I think @abouteiller referred to the flexibility the current mapping/ranking scheme offers, to map a fixed collective topology on what the human running the benchmark knows as being the hardware distribution. In any case, it is mostly an additional tool for performance validation, and has no purpose outside benchmarking / performance fiddling.

Jun 27 '22 17:06 bosilca

Makes sense - thanks for the clarification!

Given that understanding and the lack of PRRTE support resources (which goes back to the beginning of the ORTE days), I think the best path forward will be to remove the old ranking code so we minimize the maintenance effort. I can certainly cover all the typical user requests, and may be able to cover pretty much everything (still playing with algorithms).

If we hit a combination we cannot support with this new approach, we will detect it before we even start to map. In such a case, it is going to be something exotic, so this won't be a typical use-case. I think it okay in such circumstances to simply refer the user to the sequential and rank-file mappers and return a "failed to map" error.

I suspect this will be a very rare occurrence - as I said, it looks like I'm covering most (if not all) of the prior combinations. If some fall thru the cracks and someone over here really wants to support those patterns, perhaps an OMPI Python script could be written and distributed that takes the pattern and generates a rank-file output that PRRTE can ingest/parse.

Jun 29 '22 21:06 rhc54

Here is an example of one combination I am considering for "not supported":

map-by node bind-to core rank-by numa

I cannot come up with a rationale for specifying such an arrangement, but it technically was supported in the past. Requires that you bind everything first and then do a brute-force search to determine which NUMA each core is in so you can then properly order the rank values. Multiple loops thru the procs to do it, and so it is rather slow at scale and prone to error.

If someone believes a layout such as this is of importance and broadly used, then I suppose we should continue to compute it. Otherwise, I propose to defer this to rank-file or sequential mappers.

Jun 30 '22 14:06 rhc54

After prototyping a number of possible solutions, I am converging towards the following:

--map-by

Only change here was to move "pe-list" into the supported options and to add the "ordered" qualifier.

--rank-by

This is the place of greatest change. The object-based options (e.g., package, numa, etc) have been removed. Ranking follows the mapping level, so you don't get to mix/match any more. Supported options are: * slot, node * fill - rank all the procs in a given object sequentially before moving to the next one * object - rank one proc on each object on a node, looping across the objects until all procs on that node have been ranked * span - rank by object, treating all the objects in the allocation as belonging to one giant "super-node"

--bind-to

No change.

I have disabled the mindist mapper pending someone requesting we keep it alive - I'll remove it if I don't hear anything in a couple of weeks. I should have these changes done and committed in the next few days. I am also porting the ORTE map/rank/bind test from the ompi-tests repo to the pmix-tests one and updating it to support these new options.

Jul 01 '22 17:07 rhc54

I have completed this work and will be posting a PR for PRRTE that implements it. Given no input on the mindist mapper, it will include removal of that component. Someone can restore it if they care at some later point.

Jul 15 '22 03:07 rhc54

Oh, I guess someone with proper authorizations can close this now.

Jul 15 '22 03:07 rhc54

Closing based on above comments.

Oct 11 '22 20:10 awlauria

ompi ompi copied to clipboard

Proposed changes to the mapping/rank/bind system

ompi
ompi copied to clipboard