ompi
ompi copied to clipboard
osc/rdma: force enabling osc rdma get_accumulate fails with data integrity issue
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
OMPI main
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git clone
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.
$ git submodule status
41f4225d6fb806ff218eb229a9a25baf5a97c5fa 3rd-party/openpmix (v1.1.3-3573-g41f4225d)
0b580da7c8952a95a39a2cdb5d13b3453fb934ce 3rd-party/prrte (psrvr-v2.0.0rc1-4383-g0b580da7c8)
Please describe the system on which you are running
- Operating system/version: RHEL8.4
- Computer hardware: ppc64le
Details of the problem
Minimal Test that recreates the issue:
https://gist.github.com/AboorvaDevarajan/f8d2602b4eda27c4083a1b85fad4503c
Error:
$ $MPI_ROOT/bin/mpirun --tag-output --mca pml ob1 --mca btl tcp,sm,self --mca backtrace_lwcore_enable t --get-stack-traces --report-state-on-timeout --prefix $MPI_ROOT -x LD_LIBRARY_PATH --mca coll basic,inter,libnbc,self --mca osc rdma -np 2 ./test
[prterun-c685f8n02-4112871@1,0]<stdout>: CASE 2: count: 0 PASS
[prterun-c685f8n02-4112871@1,0]<stdout>: CASE 2: rank : 0 result : 0 expected : 1 win_ptr : 0 expected: 0 origin_ptr 0 expected: 0
Looks like the result buffer in get_accumulate doesn't have the expected values, Is this path (OSC RDMA) supported?
Unpack of the result buffer seems to be erroneous and this patch seems to be fixing the issue: https://github.com/open-mpi/ompi/pull/10634
Issue is fixed: (merged PRs in release and master branches):
master: https://github.com/open-mpi/ompi/pull/10634 v5.0.x: https://github.com/open-mpi/ompi/pull/10655 v4.1.x: https://github.com/open-mpi/ompi/pull/10654 v4.0.x: Issue is not present