ompi icon indicating copy to clipboard operation
ompi copied to clipboard

Coll/han Improvements on algorithm gestion through MCA and configuration file

Open FlorentGermain-Bull opened this issue 2 years ago • 7 comments

Allow topological level to be named in configuration file

Try to read toplogical level as a string then as an id in configuration file.

Improve algorithm management and choice

Uniformisation of algorithm choice mechanism. Translation table from name to function pointer is set in ompi/mca/coll/han/coll_han_algos.c as mca_base_var_enum_value_t.

Allow algorithm selection (optional) in configuration file

Algorithm choice can be made directly in the configuration file for han component (see configuration file example).

Algorithm choice through MCA parameters simplification

Algorithm choice is made using their name through an enum.

Configuration file example

1 # Number of collectives described in this file
allreduce # Set of rules for allreduce collectives
    1 # How many topological levels are described in this file
    global_communicator # Topological level
        1 # Number of configurations
        1 # Configuration size (communicator size on this level)
            4 # Number of message size rules
            0 han @intra # From 0 to 999 sized message, use intra algorithm of han component
            1000 han # From 1000 to 7999, use default algorithm of han component
            8000 han @simple # From 8000 to 19999, use simple algorithm of han component
            20000 tuned # Fallback on tuned if message size is higher than 20000

Note: Han can only be used on the global_communicator level.

Set of MCA parameters to read a han configuration file:

# Han must be selected to be used
export OMPI_MCA_coll_han_priority=100

# Activate file reading
export OMPI_MCA_coll_han_use_dynamic_file_rules=true

# Set file path
export OMPI_MCA_coll_han_dynamic_rules_filename=path/to/configuration_file

FlorentGermain-Bull avatar Sep 21 '22 14:09 FlorentGermain-Bull

Can one of the admins verify this patch?

ompiteam-bot avatar Sep 21 '22 14:09 ompiteam-bot

ok to test

jsquyres avatar Sep 21 '22 14:09 jsquyres

ok to test

awlauria avatar Sep 21 '22 14:09 awlauria

bot:ibm:retest

gpaulsen avatar Sep 21 '22 20:09 gpaulsen

@FlorentGermain-Bull Would you be able to rebase your branch on main somewhere after 7dbfbeea - build: Use open-mpi/oac for oac submodule commit? We're having an issue with the IBM CI when it tries to test a Pull Request that doesn't include that commit.

gpaulsen avatar Sep 21 '22 21:09 gpaulsen

@FlorentGermain-Bull And be sure to see https://www.mail-archive.com/[email protected]/msg21421.html

jsquyres avatar Sep 21 '22 21:09 jsquyres

bot:ibm:retest

gpaulsen avatar Sep 21 '22 21:09 gpaulsen

FYI it looks like all changes proposed in #10456 are also included here

gkatev avatar Sep 22 '22 07:09 gkatev

It worked! Thanks. I've heard that Mellanox is working on Their CI. So no action on your part for that.

gpaulsen avatar Sep 22 '22 13:09 gpaulsen

bot:aws:retest

bwbarrett avatar Sep 28 '22 02:09 bwbarrett

bot:aws:retest

gpaulsen avatar Sep 29 '22 14:09 gpaulsen

@FlorentGermain-Bull can you rebase this on top of current main if it is still something you want to get in. Thanks

awlauria avatar Oct 19 '22 14:10 awlauria

@bosilca please review so we can get this into v5.

awlauria avatar Oct 24 '22 19:10 awlauria

@FlorentGermain-Bull Are you planning to bring this back to 5.0.x?

devreal avatar Oct 26 '22 14:10 devreal

@FlorentGermain-Bull Are you planning to bring this back to 5.0.x?

sorry for the late reply I'm on it

FlorentGermain-Bull avatar Nov 08 '22 07:11 FlorentGermain-Bull