ompi
ompi copied to clipboard
MPI_Win_allocate() fails when force to use RDMA
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.0.3
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
I did not make the installation. I'm trying to get this information from who did it.
Please describe the system on which you are running
- Operational System: Red Hat 4.4.7-23
- Cluster with slurm 15.08.7
- Computer Hardware: Intel(R) Xeon(R) CPU E5-2683 v4
- Network type: Infiniband
Details of the problem
I'm trying to run a program with repetitive communication in ring type, as the example test.cpp
. When I run the example on my cluster using multiple nodes with the command salloc -N2 --hint=compute_bound --exclusive mpirun test.o
(because I am using slurm) the output is similar to
ID 0 Time 6.001007 ID 1 Time 6.001102
However, I was waiting times of 3.0 and 6.0 seconds approximately. This wrong behavior in general happened when I am not using RDMA. Then, I decided to force the program to use the rdma using the command salloc -N2 --hint=compute_bound --exclusive mpirun --mca osc rdma test.o
. However, I've received the following error
[r1i1n10:08761] *** An error occurred in MPI_Win_allocate [r1i1n10:08761] *** reported by process [418054145,1] [r1i1n10:08761] *** on communicator MPI_COMM_WORLD [r1i1n10:08761] *** MPI_ERR_WIN: invalid window [r1i1n10:08761] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [r1i1n10:08761] *** and potentially your MPI job) [service0:31286] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal [service0:31286] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
I did not expect this because I have an Infiniband on the cluster. The first case already is strange, the second I do not understand the error.
Observation
- The program runs as expected when it is used a single node with multiple processes using or not
--mca osc rdma
. - One time ago, I ran the same code and it worked. I've not noticed any change in the code or environment to now.
test.cpp
#include <iostream>
#include <unistd.h>
#include <mpi.h>
#include <math.h>
int main(int argc, char *argv[])
{
MPI_Win window;
int id, comm_sz;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &id);
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
int get_number;
int next = (id+1)%comm_sz;
double t;
int *window_buffer;
MPI_Win_allocate(sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &(window_buffer), &(window));
t = MPI_Wtime();
for (int i = 0; i < 3; i++) {
sleep(id+1);
MPI_Win_lock(MPI_LOCK_SHARED, next, 0, window);
MPI_Get(&get_number, 1, MPI_INT, next, 0, 1, MPI_INT, window);
MPI_Win_unlock(next, window);
}
printf("ID %i Time %lf\n", id, MPI_Wtime()-t);
MPI_Finalize();
return 0;
}
@jotabf Thanks for the report. Can you check whether OMPI was built with UCX support? ompi_info | grep ucx
should give some hint.
@devreal Thanks for the fast answer. I think not. I did not have output with the command ompi_info | grep ucx
.
Can you post the output of running your test with --mca osc_verbose 100
, both the one with and without forcing osc/rdma? That might give a clue what is going on...
In both cases, there was nothing different on output. Follow the commands and outputs.
salloc -N2 --hint=compute_bound --exclusive mpirun --mca osc rdma --mca osc_verbose 100 test.o
salloc: Granted job allocation 1913938 [service2:57487] *** An error occurred in MPI_Win_allocate [service2:57487] *** reported by process [946667521,1] [service2:57487] *** on communicator MPI_COMM_WORLD [service2:57487] *** MPI_ERR_WIN: invalid window [service2:57487] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [service2:57487] *** and potentially your MPI job) [service0:23216] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal [service0:23216] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages salloc: Relinquishing job allocation 1913938 salloc: Job allocation 1913938 has been revoked.
salloc -N2 --hint=compute_bound --exclusive mpirun --mca osc_verbose 100 test.o
salloc: Granted job allocation 1913940 ID 0 Time 6.000809 ID 1 Time 6.000810 salloc: Relinquishing job allocation 1913940
I think it's --mca osc_base_verbose 100
to get that verbose output.
Now we had a output, here:
salloc --nodes=2 --hint=compute_bound --exclusive mpirun --mca osc_base_verbose 100 test.o
salloc: Granted job allocation 1915069 [r1i2n9:03710] mca: base: components_register: registering framework osc components [r1i2n9:03710] mca: base: components_register: found loaded component pt2pt [r1i2n10:19737] mca: base: components_register: registering framework osc components [r1i2n10:19737] mca: base: components_register: found loaded component pt2pt [r1i2n9:03710] mca: base: components_register: component pt2pt register function successful [r1i2n9:03710] mca: base: components_register: found loaded component rdma [r1i2n10:19737] mca: base: components_register: component pt2pt register function successful [r1i2n10:19737] mca: base: components_register: found loaded component rdma [r1i2n9:03710] mca: base: components_register: component rdma register function successful [r1i2n10:19737] mca: base: components_register: component rdma register function successful [r1i2n9:03710] mca: base: components_register: found loaded component sm [r1i2n9:03710] mca: base: components_register: component sm register function successful [r1i2n10:19737] mca: base: components_register: found loaded component sm [r1i2n10:19737] mca: base: components_register: component sm register function successful [r1i2n9:03710] mca: base: components_register: found loaded component monitoring [r1i2n9:03710] mca: base: components_register: component monitoring register function successful [r1i2n9:03710] mca: base: components_open: opening osc components [r1i2n10:19737] mca: base: components_register: found loaded component monitoring [r1i2n10:19737] mca: base: components_register: component monitoring register function successful [r1i2n10:19737] mca: base: components_open: opening osc components [r1i2n10:19737] mca: base: components_open: found loaded component pt2pt [r1i2n9:03710] mca: base: components_open: found loaded component pt2pt [r1i2n9:03710] mca: base: components_open: found loaded component rdma [r1i2n9:03710] mca: base: components_open: found loaded component sm [r1i2n9:03710] mca: base: components_open: component sm open function successful [r1i2n9:03710] mca: base: components_open: found loaded component monitoring [r1i2n10:19737] mca: base: components_open: found loaded component rdma [r1i2n10:19737] mca: base: components_open: found loaded component sm [r1i2n10:19737] mca: base: components_open: component sm open function successful [r1i2n10:19737] mca: base: components_open: found loaded component monitoring [r1i2n10:19737] mca: base: close: unloading component monitoring [r1i2n9:03710] mca: base: close: unloading component monitoring ID 0 Time 6.000940 ID 1 Time 6.000957 [r1i2n10:19737] pt2pt component destroying window with id 3 [r1i2n9:03710] pt2pt component destroying window with id 3 salloc: Relinquishing job allocation 1915069 salloc: Job allocation 1915069 has been revoked
salloc --nodes=2 --hint=compute_bound --exclusive mpirun --mca osc rdma --mca osc_base_verbose 100 test.o
salloc: Granted job allocation 1915070 [r1i2n10:19816] mca: base: components_register: registering framework osc components [r1i2n10:19816] mca: base: components_register: found loaded component rdma [r1i2n10:19816] mca: base: components_register: component rdma register function successful [r1i2n10:19816] mca: base: components_open: opening osc components [r1i2n10:19816] mca: base: components_open: found loaded component rdma [r1i2n9:03793] mca: base: components_register: registering framework osc components [r1i2n9:03793] mca: base: components_register: found loaded component rdma [r1i2n9:03793] mca: base: components_register: component rdma register function successful [r1i2n9:03793] mca: base: components_open: opening osc components [r1i2n9:03793] mca: base: components_open: found loaded component rdma [r1i2n10:19816] *** An error occurred in MPI_Win_allocate [r1i2n10:19816] *** reported by process [364249089,1] [r1i2n10:19816] *** on communicator MPI_COMM_WORLD [r1i2n10:19816] *** MPI_ERR_WIN: invalid window [r1i2n10:19816] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [r1i2n10:19816] *** and potentially your MPI job) [service0:30571] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal [service0:30571] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages salloc: Relinquishing job allocation 1915070
I've made a local installation of OpenMPI using ucx.
../configure --prefix=/home/jbfernandes/.local/ --with-ucx=/home/jbfernandes/.local/ --disable-man-pages
The error when I force the RDMA disappeared. However, I do not have the correct behavior of my algorithm yet.
salloc --exclusive --hint=compute_bound -N3 ~/.local/bin/mpirun --mca osc rdma -x UCX_NET_DEVICES=ib1 ./test.o
or
salloc --exclusive --hint=compute_bound -N3 ~/.local/bin/mpirun --mca osc ucx -x UCX_NET_DEVICES=ib1 ./test.o
salloc: Granted job allocation 10604 ID 0 Time 9.001626 ID 1 Time 9.001641 ID 2 Time 9.002066 salloc: Relinquishing job allocation 10604
I hope an output similar when a run using a single node and many processes.
salloc --exclusive --hint=compute_bound -N1 --tasks-per-node=3 ~/.local/bin/mpirun --mca osc rdma -x UCX_NET_DEVICES=ib1 ./test.o
salloc: Granted job allocation 10605 ID 0 Time 3.000265 ID 1 Time 6.000318 ID 2 Time 9.000313 salloc: Relinquishing job allocation 10605
This time, I believe something is missing on installation to use appropriately the Infiniband.
I was just looking into this (sorry for the delay). I can reproduce the error mca/rdma but I'm not sure it is expected to work under all circumstances on the 4.0.x branch (osc/pt2pt was still there to cover cases where RDMA couldn't be used/detected).
I'm not sure what to make of the numbers in your results. Are you saying something is still missing or not working correctly? In any case, I recommend upgrading to the latest 4.1.x release as there were quite a few improvements to both osc/rdma and osc/ucx.
You need to enable the uct btl for those to work. --mca btl_uct_memory_domains ib1
or whatever the memory domain is for your HCA. I intend to auto-enable it but need to find a system to test on to ensure I don't break anything.
I think I have to explain better my issue.
I'm working in a heterogeneous environment, where each node has a different processing speed. To simulate this, I wrote the line sleep(id+1);
on the code test.cpp. Each process will "sleep" at a different time and it will get information from another process, at a different time. With rdma operations, I was waiting that the MPI_Get
runs immediately, but it is not happening. The process 0 is waiting for a process 1 run MPI_Get
to also run it. If I have 3 processes, the process 1 will wait for the process 2 and so on. This was my first problem. It is clear now?
When I tried to force the rdma was generated the described error with MPI_Win_allocate
. This was my second problem.
At the moment, I could remove the MPI_Win_allocate
error, doing my own openmpi installation. However, after I read your last comment (@devreal), I checked the openmpi version and I was using the 5.1.0a1, maybe because I cloned from the git repository. Following your recommendation, I downgrade to 4.1.1, but the error returned when I use only rdma (--mca osc rdma
). I think, Is it better I keep on version 5.1.0a1?
In any case, I was not successful to solve the first problem. I do not know what is exactly missing on openmpi installation or mpirun command for the osc operations (like MPI_Get) to occur appropriately.
@jotabf I tried your example with osc/ucx and get output both on shared memory and with multiple nodes in an IB network using Open MPI 4.1.1 built against UCX 1.10.0:
$ mpirun -n 4 -N 4 ./test_win_allocate9580
ID 0 Time 3.000195
ID 1 Time 6.000184
ID 2 Time 9.000180
ID 3 Time 12.000186
I believe this is what you expect, right?
I'm not sure why this does not work for you with osc/ucx. On IB networks, osc/ucx is the backend of choice while osc/rdma is more generic, supporting network such as Cray Aries and non-RDMA transfers using TCP. Note that osc/ucx should use the network's RDMA capabilities (i.e., that's not exclusive to osc/rdma, even though the name might suggest that).
In your initial post, you wrote that your using an IB network. Can you please provide more details? Which UCX version are you using?
Also, why are you explicitly specifying UCX_NET_DEVICES=ib1
? This should not be necessary (in my experience) and I wonder if that may cause issues with shared memory... (I don't know the internals of UCX though)
I believe this is what you expect, right?
Yes, it is exactly this. I have this result only when I use shared memory.
In your initial post, you wrote that your using an IB network. Can you please provide more details? Which UCX version are you using?
Model of IB driver: Mellanox Technologies MT27500 Family [ConnectX-3]
I am running on a cluster for which I do not have sudo permissions. Then, I've been trying to use the default openmpi Installation Openmpi v4.0.5
with UCX v1.9.0
, but I'm doing my own Openmpi Installation (locally) with Openmpi v4.1.1
and UCX v1.12.0
.
Also, why are you explicitly specifying UCX_NET_DEVICES=ib1? This should not be necessary (in my experience) and I wonder if that may cause issues with shared memory... (I don't know the internals of UCX though)
I did not have problems with shared memory. I specified UCX_NET_DEVICES=ib1
because I followed the ucx installation guide (https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX) and It is asked to run in this form. I've tried to run many forms with and without this but did not work. With a default OpenMP Installation, I try using UCX_NET_DEVICES=mlx4_1
, because I have this option there.
I did not have problems with shared memory. I specified
UCX_NET_DEVICES=ib1
because I followed the ucx installation guide (https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX) and It is asked to run in this form. I've tried to run many forms with and without this but did not work. With a default OpenMP Installation, I try usingUCX_NET_DEVICES=mlx4_1
, because I have this option there.
Per https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX#running-open-mpi-with-ucx, need to use something like UCX_NET_DEVICES=mlx4_1:1
UCX_NET_DEVICES=ib0
will force using TCP over IPoIB driver, which is not really using RDMA, so it's less optimal
Edit: The syntax of UCX_NET_DEVICES for RDMA case is <HCA>:<PORT>
@yosefe I already tried to run salloc --exclusive --hint=compute_bound -N2 mpirun -np 2 --mca pml ucx -x UCX_NET_DEVICES=mlx4_0:1 ./test.o
but the wrong behavior do not change.
AFAIU the problem is MPI_Get does not behave as a one-sided operation when running across nodes?
When using UCX , can you pls add -x UCX_LOG_LEVEL=info
to mpirun and post the full command line and output?
the problem is MPI_Get does not behave as a one-sided operation when running across nodes?
Exactly.
When using UCX , can you pls add -x UCX_LOG_LEVEL=info to mpirun and post the full command line and output?
$ salloc --exclusive --hint=compute_bound -N2 mpirun --mca pml ucx -x UCX_NET_DEVICES=mlx4_0:1 -x UCX_LOG_LEVEL=info ./test.o
salloc: Granted job allocation 10816
[1635793950.538248] [r1i3n1:23318:0] ucp_worker.c:1627 UCX INFO ep_cfg[0]: tag(self/memory rc_verbs/mlx4_0:1);
[1635793950.538560] [r1i3n0:23335:0] ucp_worker.c:1627 UCX INFO ep_cfg[0]: tag(self/memory rc_verbs/mlx4_0:1);
[1635793950.574017] [r1i3n1:23318:0] ucp_worker.c:1627 UCX INFO ep_cfg[1]: tag(rc_verbs/mlx4_0:1);
[1635793950.574624] [r1i3n0:23335:1] ucp_worker.c:1627 UCX INFO ep_cfg[1]: tag(rc_verbs/mlx4_0:1);
[1635793950.581761] [r1i3n1:23318:0] ucp_worker.c:1627 UCX INFO ep_cfg[0]: rma(rc_verbs/mlx4_0:1);
[1635793950.581751] [r1i3n0:23335:0] ucp_worker.c:1627 UCX INFO ep_cfg[0]: rma(self/memory posix/memory sysv/memory);
[1635793950.581931] [r1i3n0:23335:0] ucp_worker.c:1627 UCX INFO ep_cfg[1]: rma(rc_verbs/mlx4_0:1);
[1635793950.602258] [r1i3n1:23318:0] ucp_worker.c:1627 UCX INFO ep_cfg[1]: rma(self/memory posix/memory sysv/memory);
ID 1 Time 6.004446
ID 0 Time 6.004445
salloc: Relinquishing job allocation 10816
according to this log, RDMA transport IS selected in UCX. @jotabf for sake of the experiment is it possible to remove MPI_Win_lock/unlock from the loop, to see if the issue is in MPI_Get or the lock/unlock?
@yosefe I do not know if there is a way just remove the lock/unlock. However, I tried using MPI_Win_fence
on the place and It did not work.
@jotabf Removing the lock is trivial: move the MPI_Win_lock
and MPI_Win_unlock
out of the loop and use MPI_Rget
+MPI_Wait
inside the loop. No need to have the locks in each iteration...
Here the new code
#include <iostream>
#include <pthread.h>
#include <unistd.h>
#include <mpi.h>
#include <math.h>
#include <omp.h>
int main(int argc, char *argv[])
{
int id, comm_sz;
int prov;
MPI_Init_thread(&argc, &argv, MPI_THREAD_SERIALIZED, &prov);
MPI_Comm_rank(MPI_COMM_WORLD, &id);
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
int *number;
int get_number;
int next = (id+1)%comm_sz;
MPI_Status status;
MPI_Request request;
double t;
MPI_Win the_window;
MPI_Win_allocate(sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &number, &the_window);
t = MPI_Wtime();
MPI_Win_lock(MPI_LOCK_SHARED, next, 0, the_window);
for (int i = 0; i < 3; i++) {
sleep(id+1);
MPI_Rget(&get_number, 1, MPI_INT, next, 0, 1, MPI_INT, the_window, &request);
MPI_Wait(&request, &status);
}
MPI_Win_unlock(next, the_window);
printf("ID %i Time %lf\n", id, MPI_Wtime()-t);
MPI_Finalize();
return 0;
}
But the result keep the same.
$ salloc --exclusive --hint=compute_bound -N2 mpirun --mca pml ucx -x UCX_NET_DEVICES=mlx4_0:1 ./test2.o
salloc: Granted job allocation 10832
ID 0 Time 6.003328
ID 1 Time 6.003370
salloc: Relinquishing job allocation 10832
The issue is that MPI_Rget issues an atomic operation in https://github.com/open-mpi/ompi/blob/d4219f814449846d0e29672101740cdcc8a89b0f/ompi/mca/osc/ucx/osc_ucx_comm.c#L970 but hardware atomic operations are not enabled in UCX on ConnectX-3, because ConnectX-3 does not support all required atomic operations (such as 32-bit atomics)
@janjust why is the atomic operation needed in MPI_Rget/MPI_Get?
This should work with osc/rdma as an alternative until Mellanox can fix osc/ucx. You need to specifically enable the memory domain with btl/uct for it to work.
Hello everyone, thanks for the help.
A few days ago, I managed to solve it using the following sequence of commands
$ salloc --exclusive --hint=compute_bound -N2 mpirun --mca btl_openib_allow_ib 1 --mca btl_openib_if_include mlx4_0:1 --mca osc rdma --mca orte_base_help_aggregate 0 ./test.o
It was not necessary to use the ucx.
@yosefe Regarding your question: why is the atomic operation needed in MPI_Rget/MPI_Get?
I believe this is used to get a handle for the request to test/wait on. I added a comment a while back one line above whether ucp_worker_flush_nb
could be used instead. Maybe that's is the cleaner way to do it?
@devreal, PR #10554 does not solve my problem. I've done the checkout and installed your branch osc-ucx-rputget-flushnb-v5.0.x
, but it did not work.
I was reviewing my tests and I think my problem is not the MPI_Rget(). I created a simpler new test code to show this.
#include <iostream>
#include <unistd.h>
#include <mpi.h>
int main(int argc, char **argv)
{
int size, rank;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Aint *value;
MPI_Win win;
MPI_Win_allocate(sizeof(MPI_Aint), sizeof(MPI_Aint), MPI_INFO_NULL, MPI_COMM_WORLD, &(value), &(win));
srand(time(NULL) + rank);
int t_sleep = rand()%10 + 5;
t_sleep = 10*rank + 1;
for (int i = 0; i < 10; i++)
{
printf("ID %i Start TASK %i\n", rank, i);
sleep(t_sleep);
printf("ID %i Finished TASK %i\n", rank, i);
MPI_Win_lock(MPI_LOCK_SHARED, (rank+1)%size, 0, win);
printf("ID %i MPI_Win_lock after TASK %i\n", rank, i);
MPI_Win_unlock((rank+1)%size, win);
}
MPI_Win_free(&win);
MPI_Finalize();
return 0;
}
Running this code with 2 processes, I hope that process 0 finishes faster than process 1. However, this does not happen when I use --mca osc ucx
(it works with --mca osc rdma
). In this case, you can see the only fact that I am using the MPI_Win_lock() generates a synchronism that should not happen.
@jotabf I believe this comes back down to what @yosefe said in https://github.com/open-mpi/ompi/issues/9580#issuecomment-962264701 (please correct me if I'm wrong here): your network does not support 64bit atomic operations and the lock/unlock operations use 64bit atomics. I'm not really sure why that would work better with osc/rdma, from what I can see it uses 64bit operations too.
@devreal , I was checking and, in fact, the network adapter (mxl4) that I am using seems not to have this technology. However, I continue confuse about why the code works with rdma but not with ucx.
@jotabf is the failure in the same location? It's possible that we have other hw atomic operations in the code without sw fallbacks
@janjust Yes, in both cases I used the same environment.
Runing with mpirun --mca osc rdma -mca btl_openib_allow_ib 1 -mca btl_openib_if_include mlx4_0:1 ./bin/sync.o
ID 0 Start TASK 0
ID 0 Finished TASK 0
ID 1 Start TASK 0
ID 0 MPI_Win_lock after TASK 0
ID 0 Start TASK 1
ID 0 Finished TASK 1
ID 0 MPI_Win_lock after TASK 1
ID 0 Start TASK 2
ID 0 Finished TASK 2
ID 0 MPI_Win_lock after TASK 2
ID 0 Start TASK 3
ID 0 Finished TASK 3
ID 0 MPI_Win_lock after TASK 3
ID 0 Start TASK 4
ID 0 Finished TASK 4
ID 0 MPI_Win_lock after TASK 4
ID 0 Start TASK 5
ID 0 Finished TASK 5
ID 0 MPI_Win_lock after TASK 5
ID 0 Start TASK 6
ID 0 Finished TASK 6
ID 0 MPI_Win_lock after TASK 6
ID 0 Start TASK 7
ID 0 Finished TASK 7
ID 0 MPI_Win_lock after TASK 7
ID 0 Start TASK 8
[...]
Runing with mpirun -mca osc ucx -mca btl_openib_allow_ib 1 -mca btl_openib_if_include mlx4_0:1 --mca pml ucx -x UCX_NET_DEVICES=mlx4_0:1 ./bin/sync.o
ID 1 Start TASK 0
ID 0 Start TASK 0
ID 0 Finished TASK 0
ID 1 Finished TASK 0
ID 1 MPI_Win_lock after TASK 0
ID 0 MPI_Win_lock after TASK 0
ID 1 Start TASK 1
ID 0 Start TASK 1
ID 0 Finished TASK 1
ID 0 MPI_Win_lock after TASK 1
ID 0 Start TASK 2
ID 1 Finished TASK 1
ID 1 MPI_Win_lock after TASK 1
ID 1 Start TASK 2
ID 0 Finished TASK 2
[...]