atlas icon indicating copy to clipboard operation
atlas copied to clipboard

Outermost mpi::Scope does not restore prior communicator

Open fmahebert opened this issue 1 year ago • 8 comments

What happened?

The outermost atlas mpi::Scope does not restore the prior default eckit MPI communicator when it goes out of scope.

See code snippet below.

I suspect the problem is this:

  • the Scope destructor calls CommStack::pop
  • the pop decrements size_ but then uses name(), which evaluates stack_[size_-1];
  • when destructing the outermost Scope, this is no longer a valid indexing

Perhaps one solution would be to push the current eckit default communicator when creating an outermost Scope object? Basically, if size_ == 0, then push the current default THEN push the new name?

What are the steps to reproduce the bug?

This code...

eckit::mpi::addComm("comm1", 123);  // as if from fortran
eckit::mpi::addComm("comm2", 124);  // as if from fortran
eckit::mpi::setCommDefault("comm1");
std::cout << "default eckit comm = " << eckit::mpi::comm().name() << std::endl;
{
  atlas::mpi::Scope scope("comm2");
  std::cout << "default eckit comm = " << eckit::mpi::comm().name() << std::endl;
}
std::cout << "default eckit comm = " << eckit::mpi::comm().name() << std::endl;

prints out...

default eckit comm = comm1
default eckit comm = comm2
default eckit comm = world

Whereas I would expect the last output to be default eckit comm = comm1.

Version

0.36

Platform (OS and architecture)

Linux x86_64

Relevant log output

No response

Accompanying data

No response

Organisation

JCSDA

fmahebert avatar Apr 17 '24 21:04 fmahebert

Looping in also @MayeulDestouches who brought this to my attention.

fmahebert avatar Apr 17 '24 21:04 fmahebert

Thanks for bringing this to my attention. I will try to reproduce, create a test and fix asap.

wdeconinck avatar Apr 18 '24 14:04 wdeconinck

Following reproducer doesn't reproduce it for me... Could you double check?


#include <iostream>

#include "eckit/runtime/Main.h"
#include "eckit/mpi/Comm.h"

#include "atlas/parallel/mpi/mpi.h"

int main(int argc, char* argv[]) {
    eckit::Main::initialise(argc,argv);
    int irank = eckit::mpi::comm().rank();
    auto print_default_comm_name = [&] {
       if (irank == 0 ) {
           std::cout << "default eckit comm = " << eckit::mpi::comm().name() << std::endl;
       }
    };
    eckit::mpi::comm().split(irank, "comm1");
    eckit::mpi::comm().split(irank, "comm2");
    eckit::mpi::setCommDefault("comm1");

    print_default_comm_name();
    {
        atlas::mpi::Scope scope("comm2");
        print_default_comm_name();
    }
    print_default_comm_name();
    return 0;
}

Output:

default eckit comm = comm1
default eckit comm = comm2
default eckit comm = comm1

wdeconinck avatar Apr 18 '24 14:04 wdeconinck

@wdeconinck Thanks for taking a look. I confirm I see the same (= correct) output from your code snippet. Now to figure out the difference between this case and what JEDI is doing... I'll post an update when I know more.

fmahebert avatar Apr 18 '24 15:04 fmahebert

Thanks @fmahebert and @wdeconinck for investigating this. I've tried playing with Willem's code snippet, but I'm unable to reproduce the issue either. Apparently we've missed something on what triggers the error...

MayeulDestouches avatar Apr 18 '24 16:04 MayeulDestouches

This code snippet more closely mirrors what's being done in the particular JEDI scenario that @MayeulDestouches and I were testing:

#include <iostream>

#include "eckit/runtime/Main.h"
#include "eckit/mpi/Comm.h"

#include "atlas/grid/Grid.h"
#include "atlas/grid/Partitioner.h"
#include "atlas/functionspace/StructuredColumns.h"
#include "atlas/parallel/mpi/mpi.h"

int main(int argc, char* argv[]) {
    eckit::Main::initialise(argc,argv);

    int irank = eckit::mpi::comm().rank();
    auto print_default_comm_name = [&] {
       if (irank == 0 ) {
           std::cout << "default eckit comm = " << eckit::mpi::comm().name() << std::endl;
       }
    };

    eckit::mpi::comm().split(irank, "comm1");
    eckit::mpi::comm().split(irank, "comm2");

    auto make_structured_columns = [] {
        const atlas::Grid grid("F15");
        const atlas::grid::Partitioner partitioner("equal_regions");
        const atlas::functionspace::StructuredColumns sc(grid, partitioner);
    };

    eckit::mpi::setCommDefault("comm1");
    print_default_comm_name();
    make_structured_columns();
    print_default_comm_name();

    eckit::mpi::setCommDefault("comm2");
    print_default_comm_name();
    make_structured_columns();
    print_default_comm_name();

    return 0;
}

And the output for this is,

default eckit comm = comm1
default eckit comm = comm1
default eckit comm = comm2
default eckit comm = comm1

(where the last line was expected to be comm2...)

Something about constructing the StructuredColumns appears to be resetting the default MPI comm. We had guessed it was the mpi::Scope in the StructuredColumns constructor, but that may have been too hasty of a guess.

@wdeconinck Can you confirm whether you see this same output and/or whether that matches your expectation?

fmahebert avatar Apr 18 '24 21:04 fmahebert

I'm wondering if for our use-case we might be better off using the mpi_comm config option for constructing the atlas objects, and not using the setCommDefault. @wdeconinck do you have a recommendation?

fmahebert avatar Apr 19 '24 01:04 fmahebert

Thank you for the new reproducer. I could simplify it a bit:

#include <iostream>

#include "eckit/runtime/Main.h"
#include "eckit/mpi/Comm.h"

#include "atlas/parallel/mpi/mpi.h"

int main(int argc, char* argv[]) {
    eckit::Main::initialise(argc,argv);
    int irank = eckit::mpi::comm().rank();
    auto print_default_comm_name = [&] {
       if (irank == 0 ) {
           std::cout << "default eckit comm = " << eckit::mpi::comm().name() << std::endl;
       }
    };
    eckit::mpi::comm().split(irank, "comm1");
    eckit::mpi::comm().split(irank, "comm2");

    auto make_structured_columns = [&] {
       atlas::mpi::Scope scope("comm2");
   };

    eckit::mpi::setCommDefault("comm1");
    print_default_comm_name();
    make_structured_columns();
    print_default_comm_name();

    eckit::mpi::setCommDefault("comm2");
    print_default_comm_name();
    make_structured_columns();
    print_default_comm_name();
 
    return 0;
}

This is definitely unintended behaviour which I want to fix.

wdeconinck avatar Apr 22 '24 13:04 wdeconinck