ompi icon indicating copy to clipboard operation
ompi copied to clipboard

osc/rdma: force enabling osc rdma get_accumulate fails with data integrity issue

Open AboorvaDevarajan opened this issue 3 years ago • 1 comments

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

OMPI main

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

$ git submodule status
41f4225d6fb806ff218eb229a9a25baf5a97c5fa 3rd-party/openpmix (v1.1.3-3573-g41f4225d)
0b580da7c8952a95a39a2cdb5d13b3453fb934ce 3rd-party/prrte (psrvr-v2.0.0rc1-4383-g0b580da7c8)

Please describe the system on which you are running

  • Operating system/version: RHEL8.4
  • Computer hardware: ppc64le

Details of the problem

Minimal Test that recreates the issue:

https://gist.github.com/AboorvaDevarajan/f8d2602b4eda27c4083a1b85fad4503c

Error:

$ $MPI_ROOT/bin/mpirun   --tag-output --mca pml ob1 --mca btl tcp,sm,self  --mca backtrace_lwcore_enable t --get-stack-traces --report-state-on-timeout  --prefix $MPI_ROOT -x LD_LIBRARY_PATH --mca coll basic,inter,libnbc,self --mca osc rdma -np 2 ./test                                                             

[prterun-c685f8n02-4112871@1,0]<stdout>: CASE 2: count: 0 PASS
[prterun-c685f8n02-4112871@1,0]<stdout>: CASE 2: rank : 0 result : 0 expected : 1 win_ptr : 0 expected: 0 origin_ptr 0 expected: 0

Looks like the result buffer in get_accumulate doesn't have the expected values, Is this path (OSC RDMA) supported?

AboorvaDevarajan avatar Aug 08 '22 05:08 AboorvaDevarajan

Unpack of the result buffer seems to be erroneous and this patch seems to be fixing the issue: https://github.com/open-mpi/ompi/pull/10634

AboorvaDevarajan avatar Aug 08 '22 05:08 AboorvaDevarajan

Issue is fixed: (merged PRs in release and master branches):

master: https://github.com/open-mpi/ompi/pull/10634 v5.0.x: https://github.com/open-mpi/ompi/pull/10655 v4.1.x: https://github.com/open-mpi/ompi/pull/10654 v4.0.x: Issue is not present

AboorvaDevarajan avatar Aug 25 '22 12:08 AboorvaDevarajan