Enable distributed parallelism
Motivation
As of today, working on distributed parallelism in Fortran mostly implies using MPI or coarray. But as of now, one has to decide early on which one to relay on. I would like to propose for stdlib to wrap certain basic reduction operators which can rely on either of them through C-preprocessing such that other procedures from stdlib could profit from such a wrapper.
I'll try to give a picture with a very simple example, let's say computing the norm2 of a 1D array. This operations requires a parallel sum reduction before computing the square root:
<kind> :: x(:) !> in spmd, each process/image has a partial portion of the array
<kind> :: local_sum , global_sum
...
local_sum = dot_product( x, x ) !> this sum is incomplete with respect to the distributed data
#if defined(STDLIB_WITH_MPI)
call MPI_Allreduce(local_sum, global_sum, 1, MPI_<kind>, MPI_SUM, MPI_COMM_WORLD, ierr)
#elif defined(STDLIB_WITH_COARRAY)
global_sum = local_sum
call co_sum( global_sum )
#endif
...
norm2 =sqrt( global_sum )
If stdlib proposed wrappers for these reduction operators it could be possible to make some of the functionalities work also transparently on distributed frameworks. The idea could consist on having a module stdlib_distributed or stdlib_coarray (to promote coarray-like syntax ? ) and then:
module stdlib_<name_to_chose>
interface stdlib_co_sum
module procedure :: stdlib_co_sum_<kind>
...
end interface
contains
subroutine stdlib_co_sum_<kind>( A, result_image, stat, errmsg)
<kind>, intent(inout) :: A(..)
integer, intent(in), optional :: result_image
integer, intent(out), optional :: stat
character(*), intent(inout), optional :: errmsg
...
select rank(A)
rank(0)
#if defined(STDLIB_WITH_MPI)
call MPI_Allreduce(A, global_sum, size( A ), MPI_<kind>, MPI_SUM, MPI_COMM_WORLD, ierr)
A = global_sum
#elif defined(STDLIB_WITH_COARRAY)
call co_sum(A, result_image, stat, errmsg)
#endif
rank(1)
#if defined(STDLIB_WITH_MPI)
call MPI_Allreduce(A, global_sum, size( A ), MPI_<kind>, MPI_SUM, MPI_COMM_WORLD, ierr)
A = global_sum
#elif defined(STDLIB_WITH_COARRAY)
call co_sum(A, result_image, stat, errmsg)
#endif
rank(2)
#if defined(STDLIB_WITH_MPI)
call MPI_Allreduce(A, global_sum, size( A ), MPI_<kind>, MPI_SUM, MPI_COMM_WORLD, ierr)
A = global_sum
#elif defined(STDLIB_WITH_COARRAY)
call co_sum(A, result_image, stat, errmsg)
#endif
...
end select
end subroutine
end module
Like this, if one doesn't link against any of them, the kernels do nothing and return the same value. If linked, then one can rely on stdlib as an intermediate wrapper.
I haven't fully thought this through but I would like to open it for discussion.
Prior Art
No response
Additional Information
No response
I like the idea of having implementations agnostic to the distributed computation strategy considered. I only have experience with MPI though, so I can't really tell for other. In case of MPI, how would you do however to handle the MPI_COMM_WORLD? A stdlib_set_comm_world function that writes to a module-level variable so that it is available to the actual MPI commands inside this module ?
I also have mainly experience with MPI but I like the promise of coarray syntax, so looking forward to the day I might be able to use it :)
how would you do however to handle the MPI_COMM_WORLD? A stdlib_set_comm_world function that writes to a module-level variable so that it is available to the actual MPI commands inside this module ?
This could be one good solution.