ompi icon indicating copy to clipboard operation
ompi copied to clipboard

Regression in MPI_MIN of MPI_UNSIGNED_LONG from v4.1.1 to v4.1.2

Open roystgnr opened this issue 3 years ago • 4 comments

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.1.2, and the bug is still present for me in a43c46f846d98870d39e25f78c1ce09e4420effc from origin/main

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

I first hit the bug after upgrading to the dpkg installed by Ubuntu 22.04, but I can reproduce it by building the v4.1.2 or the above hash (HEAD when I cloned) of origin/main

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

On main I see

-6c9d3dde370cfb61739ba312f7631cd0f44eac3d 3rd-party/openpmix
-df7d17d0a3886ebe17607cdbdcac666f2936740b 3rd-party/prrte

But after git checkout v4.1.1 or v4.1.2 I'm seeing no output.

Please describe the system on which you are running

  • Operating system/version: Ubuntu 22.04 LTS
  • Computer hardware: system76 Thelio Mega (3990X CPU, amd64, so unsigned long is 8 bytes)
  • Network type: On-node

Details of the problem

I boiled down my (C++, sorry) code to the following mpimin.C:

#include <mpi.h>

#include <cstdlib>
#include <iostream>
#include <limits>

int main(int argc, const char * const * argv)
{
  MPI_Init(&argc, const_cast<char ***>(&argv));

  unsigned long min = std::numeric_limits<unsigned long>::max();
  if (argc > 1)
    min = std::strtoul(argv[1], nullptr, 0);

  int rank = 0;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  if (rank % 2)
    min = 1;

  std::cerr << "rank " << rank << " input was " << min << std::endl;

  MPI_Allreduce (MPI_IN_PLACE, &min, 1,
                 MPI_UNSIGNED_LONG, MPI_MIN,
                 MPI_COMM_WORLD);

  if (min != 1)
    std::cerr << "min is " << min << std::endl;

  MPI_Finalize();

  return 0;
}

Compiling this with mpicxx -o mpimin.x mpimin.C, then running with mpiexec -n 2 ./mpimin.x, with OpenMPI 4.1.1 I get the output I expect:

rank 0 input was 18446744073709551615
rank 1 input was 1

(not necessarily with the lines in that order, naturally)

But with OpenMPI 4.1.2 I get:

rank 0 input was 18446744073709551615
min is 18446744073709551615
rank 1 input was 1
min is 18446744073709551615

The "giant number gets preferred to 1 by MPI_MIN" bug appears to occur consistently if-and-only-if the value of "giant" wouldn't fit in a signed long:

~/t/mpimin> mpiexec -n 2 ./mpimin.ompi412 0x7FFFFFFFFFFFFFFF
rank 1 input was 1
rank 0 input was 9223372036854775807
~/t/mpimin> mpiexec -n 2 ./mpimin.ompi412 0x8000000000000000
rank 0 input was 9223372036854775808
min is 9223372036854775808
rank 1 input was 1
min is 9223372036854775808

I can't seem to replicate the problem with any unsigned types other than unsigned long ... but that alone is enough to completely wreck my code; we have a lot of 8 byte unsigned types, and a lot of MPI_MIN reductions wherein one or more ranks will try to "sit out" the process by passing in ULONG_MAX and expecting the result to go unchanged.

roystgnr avatar Jun 28 '22 18:06 roystgnr

@bosilca Any idea what happened here, perchance?

jsquyres avatar Jun 28 '22 19:06 jsquyres

@roystgnr Thanks for the report. I think I have a lead, will file a PR shortly.

devreal avatar Jun 28 '22 19:06 devreal

https://github.com/open-mpi/ompi/pull/10527 should fix it (seems like a copy&paste error)

devreal avatar Jun 28 '22 19:06 devreal

Re-opening until this goes back to all release branches. Thanks @devreal

awlauria avatar Jun 30 '22 13:06 awlauria

https://github.com/open-mpi/ompi/pull/10527 was ported to 5.0.x, 4.1.x, and 4.0.x. Closing this issue.

devreal avatar Sep 07 '22 15:09 devreal