ompi icon indicating copy to clipboard operation
ompi copied to clipboard

coll/HAN: Don't DQ HAN dynamic @ intra-node subcomm + typo fixes

Open gkatev opened this issue 2 years ago • 17 comments

HAN disables itself when running in a single node, but that shouldn't include the subcommunicator created by HAN-dynamic. See also #10438.

Tested on v5.0.x Signed-off-by: George Katevenis [email protected]

gkatev avatar Jun 08 '22 14:06 gkatev

Can one of the admins verify this patch?

ompiteam-bot avatar Jun 08 '22 14:06 ompiteam-bot

@FlorentGermain-Bull

gkatev avatar Jun 08 '22 14:06 gkatev

ok to test

awlauria avatar Jun 08 '22 16:06 awlauria

Fixing the typos is good, but enabling HAN at the node level needs to be backed by some evidence. Why is this necessary ? In which case this leads to improved performance ?

bosilca avatar Jun 08 '22 16:06 bosilca

This doesn't actually allow HAN to run at the node-level, but rather HAN's intra-node subcommunicator. (note the added INTRA_NODE check in the diff)

When mca_coll_han_comm_create_new creates the inter/intra comms, it sets the INTRA_NODE/INTER_NODE info keys. These sub-comms also use HAN (#10456). HAN's dynamic functions detect if the current communicator is a sub-communicator, via the info key (topo_lvl), and they delegate to the respective component.

gkatev avatar Jun 08 '22 17:06 gkatev

The first part of the check will only succeed if all processes are local (aka spawned by the same RTE daemon). I don't think that can happen for an INTRA comm.

bosilca avatar Jun 08 '22 19:06 bosilca

This is the path that I have in mind:

  1. App calls Barrier
  2. Enter mca_coll_han_barrier_intra_dynamic, take 3rd if case (topo_lvl = GLOBAL_COMMUNICATOR)
  3. Enter mca_coll_han_barrier_intra_simple
  4. Enter mca_coll_han_comm_create_new
  5. Two communicators are created, both preferring HAN (assuming #10456 is merged). One communicator has INTRA_NODE set via ompi_comm_coll_han_topo_level, the other INTER_NODE
  • During low comm's creation, han is considered. Without this change, it is disqualified because the communicator spans only one node.
    • With the change, the INTRA_NODE info key is detected, and HAN is chosen for the intra-node sub-comm
  1. Control returns to mca_coll_han_barrier_intra_simple, low_comm's coll_barrier gets called
  2. Enter mca_coll_han_barrier_intra_dynamic, take 4th case (topo_lvl = INTRA_NODE)
  3. Enter the actual submodule that handles the intra-node barrier

gkatev avatar Jun 08 '22 19:06 gkatev

In the current code the step 7 will be skipped, and the control is going directly from your step 6 to your step 8. This brings me back to my original question, is there a realistic need to give the control back to HAN on our own communicators ? The only reason I can see, is if we want to support multiple levels, but I might have missed something.

bosilca avatar Jun 08 '22 21:06 bosilca

Oh I understand, I believe step 7 is necessary in order for HAN to decide which component/module should be called for the intra-node level. Unlike the non-dynamic path where this is decided at split-time, in the dynamic it is decided when the collective is called. Without this change, the component that gets used for the intra-node level is the one with the next higer priority, and the coll_han_<coll>_dynamic_intra_node_module parameter has no effect. I will perform some additional tests to make double sure things are indeed the way I describe them.

gkatev avatar Jun 08 '22 21:06 gkatev

I did some extra tests for completeness (2 nodes, 2 ranks each)

Before change:

coll_han_barrier_dynamic_intra_node_module=3 (tuned)
rank 0: Base barrier intra rec.doubling @ INTRA_NODE
rank 0: Base barrier intra rec.doubling @ INTER_NODE
rank 0: Base barrier intra rec.doubling @ INTRA_NODE

coll_han_barrier_dynamic_intra_node_module=1 (basic)
rank 0: Base barrier intra rec.doubling @ INTRA_NODE
rank 0: Base barrier intra rec.doubling @ INTER_NODE
rank 0: Base barrier intra rec.doubling @ INTRA_NODE

After change:

coll_han_barrier_dynamic_intra_node_module=3 (tuned)
rank 0: Base barrier intra rec.doubling @ INTRA_NODE
rank 0: Base barrier intra rec.doubling @ INTER_NODE
rank 0: Base barrier intra rec.doubling @ INTRA_NODE

coll_han_barrier_dynamic_intra_node_module=1 (basic)
rank 0: Base barrier intra linear @ INTRA_NODE
rank 0: Base barrier intra rec.doubling @ INTER_NODE
rank 0: Base barrier intra linear @ INTRA_NODE

The thought occurs, whether HAN's dynamic function should be invoked each time a collective is called, or if the underlying module's function should be called directly instead. I'm not certain what kind of overhead the dynamic functions impose, some things do get cached, so they will only add overhead the first time. I believe we could in HAN-dynamic, instead of just call another function according to the rules, also update the communicator to not go through HAN in future calls.

gkatev avatar Jun 09 '22 06:06 gkatev

There are two conigurations type in han.

If we only use configuration through MCA parameters, yes we can update function pointers of c_coll in communicator structure at first dynamic call.

If we use a configuration file, module called can vary regarding message size. In that case, updating c_coll function pointers is not possible.

To add more value to dynamic choice of han, there is an example where it is usefull: Let A and B be two components providing bcast and reduce implementation. Has its granularity is on module scale, component choice by priority do not allow me to use bcast from A and reduce from B.

FlorentGermain-Bull avatar Jun 09 '22 07:06 FlorentGermain-Bull

Ah okay make sense, I wasn't aware of the possibility for message-size-based selection

gkatev avatar Jun 09 '22 07:06 gkatev

Did the controversial parts of this PR become #10456? Should this PR be reduced to just the typo fixes, or closed if it is fully replaced by #10456?

jsquyres avatar Jul 17 '22 12:07 jsquyres

No the two PRs don't currently overlap

gkatev avatar Jul 17 '22 13:07 gkatev

No the two PRs don't currently overlap

Cool. Where are we on this PR, then?

jsquyres avatar Jul 19 '22 14:07 jsquyres

IMO, it's ready and good to go. @bosilca, did my explanations above address the concerns?

Edit: To clarify the relation of this PR with #10456:

  • #10456 reverts a change that effectively disabled HAN's "dynamic" path. At the same time, it fixes the original bug (a typo) that triggered the PR (#8250) that disabled the dynamic path
  • This PR is for the dynamic path, and so is related to the aforementioned ones in that it is only meaningful when the path is not disabled
    • The contained fix allows HAN's dynamic sub-module selection to properly function on the intra-node sub-comm

gkatev avatar Jul 19 '22 14:07 gkatev

@bosilca / @devreal ping - please review.

awlauria avatar Aug 19 '22 15:08 awlauria

@gkatev Can you rebase this PR so that it picks up the new CI? Thanks!

jsquyres avatar Sep 28 '22 15:09 jsquyres

Should we also apply this to 5.0.x? (let me know and I will make a PR)

gkatev avatar Oct 17 '22 08:10 gkatev