ompi
ompi copied to clipboard
coll/HAN: Don't DQ HAN dynamic @ intra-node subcomm + typo fixes
HAN disables itself when running in a single node, but that shouldn't include the subcommunicator created by HAN-dynamic. See also #10438.
Tested on v5.0.x Signed-off-by: George Katevenis [email protected]
Can one of the admins verify this patch?
@FlorentGermain-Bull
ok to test
Fixing the typos is good, but enabling HAN at the node level needs to be backed by some evidence. Why is this necessary ? In which case this leads to improved performance ?
This doesn't actually allow HAN to run at the node-level, but rather HAN's intra-node subcommunicator. (note the added INTRA_NODE check in the diff)
When mca_coll_han_comm_create_new creates the inter/intra comms, it sets the INTRA_NODE/INTER_NODE info keys. These sub-comms also use HAN (#10456). HAN's dynamic functions detect if the current communicator is a sub-communicator, via the info key (topo_lvl), and they delegate to the respective component.
The first part of the check will only succeed if all processes are local (aka spawned by the same RTE daemon). I don't think that can happen for an INTRA comm.
This is the path that I have in mind:
- App calls Barrier
- Enter mca_coll_han_barrier_intra_dynamic, take 3rd if case (topo_lvl = GLOBAL_COMMUNICATOR)
- Enter mca_coll_han_barrier_intra_simple
- Enter mca_coll_han_comm_create_new
- Two communicators are created, both preferring HAN (assuming #10456 is merged). One communicator has INTRA_NODE set via ompi_comm_coll_han_topo_level, the other INTER_NODE
- During low comm's creation, han is considered. Without this change, it is disqualified because the communicator spans only one node.
- With the change, the INTRA_NODE info key is detected, and HAN is chosen for the intra-node sub-comm
- Control returns to mca_coll_han_barrier_intra_simple, low_comm's coll_barrier gets called
- Enter mca_coll_han_barrier_intra_dynamic, take 4th case (topo_lvl = INTRA_NODE)
- Enter the actual submodule that handles the intra-node barrier
In the current code the step 7 will be skipped, and the control is going directly from your step 6 to your step 8. This brings me back to my original question, is there a realistic need to give the control back to HAN on our own communicators ? The only reason I can see, is if we want to support multiple levels, but I might have missed something.
Oh I understand, I believe step 7 is necessary in order for HAN to decide which component/module should be called for the intra-node level. Unlike the non-dynamic path where this is decided at split-time, in the dynamic it is decided when the collective is called. Without this change, the component that gets used for the intra-node level is the one with the next higer priority, and the coll_han_<coll>_dynamic_intra_node_module
parameter has no effect. I will perform some additional tests to make double sure things are indeed the way I describe them.
I did some extra tests for completeness (2 nodes, 2 ranks each)
Before change:
coll_han_barrier_dynamic_intra_node_module=3 (tuned)
rank 0: Base barrier intra rec.doubling @ INTRA_NODE
rank 0: Base barrier intra rec.doubling @ INTER_NODE
rank 0: Base barrier intra rec.doubling @ INTRA_NODE
coll_han_barrier_dynamic_intra_node_module=1 (basic)
rank 0: Base barrier intra rec.doubling @ INTRA_NODE
rank 0: Base barrier intra rec.doubling @ INTER_NODE
rank 0: Base barrier intra rec.doubling @ INTRA_NODE
After change:
coll_han_barrier_dynamic_intra_node_module=3 (tuned)
rank 0: Base barrier intra rec.doubling @ INTRA_NODE
rank 0: Base barrier intra rec.doubling @ INTER_NODE
rank 0: Base barrier intra rec.doubling @ INTRA_NODE
coll_han_barrier_dynamic_intra_node_module=1 (basic)
rank 0: Base barrier intra linear @ INTRA_NODE
rank 0: Base barrier intra rec.doubling @ INTER_NODE
rank 0: Base barrier intra linear @ INTRA_NODE
The thought occurs, whether HAN's dynamic function should be invoked each time a collective is called, or if the underlying module's function should be called directly instead. I'm not certain what kind of overhead the dynamic functions impose, some things do get cached, so they will only add overhead the first time. I believe we could in HAN-dynamic, instead of just call another function according to the rules, also update the communicator to not go through HAN in future calls.
There are two conigurations type in han.
If we only use configuration through MCA parameters, yes we can update function pointers of c_coll in communicator structure at first dynamic call.
If we use a configuration file, module called can vary regarding message size. In that case, updating c_coll function pointers is not possible.
To add more value to dynamic choice of han, there is an example where it is usefull: Let A and B be two components providing bcast and reduce implementation. Has its granularity is on module scale, component choice by priority do not allow me to use bcast from A and reduce from B.
Ah okay make sense, I wasn't aware of the possibility for message-size-based selection
Did the controversial parts of this PR become #10456? Should this PR be reduced to just the typo fixes, or closed if it is fully replaced by #10456?
No the two PRs don't currently overlap
No the two PRs don't currently overlap
Cool. Where are we on this PR, then?
IMO, it's ready and good to go. @bosilca, did my explanations above address the concerns?
Edit: To clarify the relation of this PR with #10456:
- #10456 reverts a change that effectively disabled HAN's "dynamic" path. At the same time, it fixes the original bug (a typo) that triggered the PR (#8250) that disabled the dynamic path
- This PR is for the dynamic path, and so is related to the aforementioned ones in that it is only meaningful when the path is not disabled
- The contained fix allows HAN's dynamic sub-module selection to properly function on the intra-node sub-comm
@bosilca / @devreal ping - please review.
@gkatev Can you rebase this PR so that it picks up the new CI? Thanks!
Should we also apply this to 5.0.x? (let me know and I will make a PR)