ompi
ompi copied to clipboard
Regression in MPI_MIN of MPI_UNSIGNED_LONG from v4.1.1 to v4.1.2
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.1.2, and the bug is still present for me in a43c46f846d98870d39e25f78c1ce09e4420effc from origin/main
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
I first hit the bug after upgrading to the dpkg installed by Ubuntu 22.04, but I can reproduce it by building the v4.1.2 or the above hash (HEAD when I cloned) of origin/main
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.
On main I see
-6c9d3dde370cfb61739ba312f7631cd0f44eac3d 3rd-party/openpmix
-df7d17d0a3886ebe17607cdbdcac666f2936740b 3rd-party/prrte
But after git checkout v4.1.1 or v4.1.2 I'm seeing no output.
Please describe the system on which you are running
- Operating system/version: Ubuntu 22.04 LTS
- Computer hardware: system76 Thelio Mega (3990X CPU, amd64, so
unsigned longis 8 bytes) - Network type: On-node
Details of the problem
I boiled down my (C++, sorry) code to the following mpimin.C:
#include <mpi.h>
#include <cstdlib>
#include <iostream>
#include <limits>
int main(int argc, const char * const * argv)
{
MPI_Init(&argc, const_cast<char ***>(&argv));
unsigned long min = std::numeric_limits<unsigned long>::max();
if (argc > 1)
min = std::strtoul(argv[1], nullptr, 0);
int rank = 0;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank % 2)
min = 1;
std::cerr << "rank " << rank << " input was " << min << std::endl;
MPI_Allreduce (MPI_IN_PLACE, &min, 1,
MPI_UNSIGNED_LONG, MPI_MIN,
MPI_COMM_WORLD);
if (min != 1)
std::cerr << "min is " << min << std::endl;
MPI_Finalize();
return 0;
}
Compiling this with mpicxx -o mpimin.x mpimin.C, then running with mpiexec -n 2 ./mpimin.x, with OpenMPI 4.1.1 I get the output I expect:
rank 0 input was 18446744073709551615
rank 1 input was 1
(not necessarily with the lines in that order, naturally)
But with OpenMPI 4.1.2 I get:
rank 0 input was 18446744073709551615
min is 18446744073709551615
rank 1 input was 1
min is 18446744073709551615
The "giant number gets preferred to 1 by MPI_MIN" bug appears to occur consistently if-and-only-if the value of "giant" wouldn't fit in a signed long:
~/t/mpimin> mpiexec -n 2 ./mpimin.ompi412 0x7FFFFFFFFFFFFFFF
rank 1 input was 1
rank 0 input was 9223372036854775807
~/t/mpimin> mpiexec -n 2 ./mpimin.ompi412 0x8000000000000000
rank 0 input was 9223372036854775808
min is 9223372036854775808
rank 1 input was 1
min is 9223372036854775808
I can't seem to replicate the problem with any unsigned types other than unsigned long ... but that alone is enough to completely wreck my code; we have a lot of 8 byte unsigned types, and a lot of MPI_MIN reductions wherein one or more ranks will try to "sit out" the process by passing in ULONG_MAX and expecting the result to go unchanged.
@bosilca Any idea what happened here, perchance?
@roystgnr Thanks for the report. I think I have a lead, will file a PR shortly.
https://github.com/open-mpi/ompi/pull/10527 should fix it (seems like a copy&paste error)
Re-opening until this goes back to all release branches. Thanks @devreal
https://github.com/open-mpi/ompi/pull/10527 was ported to 5.0.x, 4.1.x, and 4.0.x. Closing this issue.