ompi icon indicating copy to clipboard operation
ompi copied to clipboard

Buffer size issue of MPI_Isend+MPI_wait /Irecv+MPI_wait in transmitting huge data across computers in a slow network

Open pflee2002 opened this issue 7 years ago • 27 comments

Thank you for taking the time to submit an issue!

Background information

Tried my program on a hpc clusters with two computers: Master: 64 GB memory and 6 CPU cores : Slave: 20 cores 192 GB memory; They are connected via an regular ethernet.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

v3.1.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from

from source code

Please describe the system on which you are running

Ubuntu 18.4 x86_64

Details of the problem

My issue is that MPI_Isend+MPI_wait v.s. MPI_Irevc+MPI_wait is not quite reliable when the exchanged data between computers are huge.

For the same code, it runs well with small example but randomly halt in the middle for big example (many Giga data) iterations and I tracked the reason and found out, the ending message after each iteration is sent out normally but sometimes the receivers cannot receive this ending message randomly. It must have been lost in the transmission.

Compared with the massive data exchange between two computers, I think the Ethernet only has 10 M bandwidth and so the it can be seen (using “top” in Ubuntu) that the buff/cache is very big in both computers.

Do you think it is the limit of Buffer size for MPI_Isend/recv? Is there a way to increase the MPI_Isend buffer? Since the data to be sent is much bigger than the Ethernet network can handle. I guess those data must be queued in the buffers (both sender and receiver sides) for a long time. It could lose the data. This is my speculation. but I wish anyone could provide me some insights.

pflee2002 avatar Nov 16 '18 22:11 pflee2002

Can you provide a small C program that shows the problem?

If you are using the TCP BTL, TCP is a lossless, connection-oriented transport. It should not drop messages, no matter the size (without notifying Open MPI, at least -- in which case Open MPI would throw an error and abort, not hang).

jsquyres avatar Nov 16 '18 22:11 jsquyres

For instance, two process is sending message to each other. (....,tons of other types of messages have been sent and queued either in sending buffer or receiving buffer....) then One process send some a (ending)message to a target process like:

            three_int data992;
            data992.type_id=992;
            data992.int_value=1;
            hpc_node_id=(int)floor(hpc_id_rank/shmcomm_size);
            for (int s=0;s<g_number_of_hpc_nodes;s++) 
            {
                if(s!=hpc_node_id)
                {
                   int target_world_id=shmcomm_size*s+hpc_local_rank[0]; //this will be the target process on a different hpc node
                   MPI_Request rq;
                   MPI_Isend(&data992, 1, MPI_ACTION_TYPE, target_world_id, 992,MPI_COMM_WORLD,&rq);
                   //printf("992 is sent from %d to %d\n",hpc_id_rank,target_world_id);
                   MPI_Wait(&rq, MPI_STATUS_IGNORE);
                }
            }

Then two processes begin to receive:

            int flag=1;
            int end_msg2_counter=0;
            while(1)
            {
                if(g_number_of_hpc_nodes==1)//single hpc node, no msg should be sent or received
                    break;
                MPI_Request rq;
                MPI_Status status;
                MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &flag, &status );//to probe if there is a message sent to me!
                if (flag==0) //no message comes
                    continue;
                three_int received_data;
                MPI_Irecv(&received_data, 1, MPI_ACTION_TYPE, status.MPI_SOURCE, status.MPI_TAG, MPI_COMM_WORLD,&rq);
                MPI_Wait(&rq, MPI_STATUS_IGNORE);
            }

(if the data type is 992, then this iteration ends, but randomly one process cannot receive 992 message and the whole process halted).

pflee2002 avatar Nov 16 '18 22:11 pflee2002

Dear Jsquyres: Could you please more specific on how to use "TCP BTL". I've been using the default configuration OpenMPI provides.

Thank you very much.

pflee2002 avatar Nov 16 '18 22:11 pflee2002

BTW, each message is very small but they are just a huge amount sent simultaneously through multiple threads (using openmp). Thank you very much.

pflee2002 avatar Nov 16 '18 22:11 pflee2002

If you could send a small, self-contained, complete C program that shows the problem, that would be most helpful.

In your case, the TCP BTL is likely the default in your setup -- unless you have some other HPC-class network/NIC...?

The "BTL" is one of Open MPI's transport plugins (Byte Transfer Layer).

You mention OpenMP. If you are using MPI from multiple threads simultaneously, are you initializing your MPI program to MPI_THREAD_MULTIPLE?

jsquyres avatar Nov 17 '18 00:11 jsquyres

I had to fix the code you pasted to make it readable. Please check that I didn't alter the logic of the code.

This issue is confusing. You start by claiming that there is too much data exchanged between 2 nodes and that you want to increase the buffer size for MPI_Isend. Are you planning to aggregate messages to decrease their number (but increase their size) ?

To join Jeff claim, there is little chance OMPI lose data, even for multithreaded applications. However, due to the potentially high injection rate and the loosely synchronized processes (mainly due to the fact that all messages are below the eager size and they get sent directly), you might hit all the slowest paths in the communication protocol (basically all your messages become unexpected on the receiver side). If this is the case your apparent deadlock might simply be an extreme slowdown. To check this you can simply print something every X messages to see if you deadlock or slowdown.

Another thing you can check is to transform all your sends in synchronous send. This will put a handshake on each communication, preventing the unexpected messages buildup on the receiver side.

bosilca avatar Nov 17 '18 00:11 bosilca

jsquyres, thanks for the reply. As I said, it works well in small examples but randomly halt due to loss of message work when the data exchange becomes huge. I do use the "MPI_THREAD_MULTIPLE" to start this openmpi+openmp program.

If you could send a small, self-contained, complete C program that shows the problem, that would be most helpful.

In your case, the TCP BTL is likely the default in your setup -- unless you have some other HPC-class network/NIC...?

The "BTL" is one of Open MPI's transport plugins (Byte Transfer Layer).

You mention OpenMP. If you are using MPI from multiple threads simultaneously, are you initializing your MPI program to MPI_THREAD_MULTIPLE?

pflee2002 avatar Nov 17 '18 11:11 pflee2002

I had to fix the code you pasted to make it readable. Please check that I didn't alter the logic of the code.

This issue is confusing. You start by claiming that there is too much data exchanged between 2 nodes and that you want to increase the buffer size for MPI_Isend. Are you planning to aggregate messages to decrease their number (but increase their size) ?

To join Jeff claim, there is little chance OMPI lose data, even for multithreaded applications. However, due to the potentially high injection rate and the loosely synchronized processes (mainly due to the fact that all messages are below the eager size and they get sent directly), you might hit all the slowest paths in the communication protocol (basically all your messages become unexpected on the receiver side). If this is the case your apparent deadlock might simply be an extreme slowdown. To check this you can simply print something every X messages to see if you deadlock or slowdown.

Another thing you can check is to transform all your sends in synchronous send. This will put a handshake on each communication, preventing the unexpected messages buildup on the receiver side.

bosilca, thanks. I was wondering if openMPI has a limitation of sending and receiving buffers for all the queuing messages due to the computer's RAM? I can't change the number/size of Isend and Irecv messages. As for the waiting, I noticed once it halted, it never moved again (I waited 10 hours once)

pflee2002 avatar Nov 17 '18 12:11 pflee2002

@pflee2002 The best thing would be if you could send us a small C reproducer of your problem. I.e., a small, self-contained C program (i.e., with a main() and everything) that acts similarly to your real program and shows the problem.

You can check how much free RAM is available on your computer by running the free command while your program is running. This will tell you if your computer is running out of RAM / dipping in to swap.

You might also want to check to make sure your program isn't simply deadlocking in your code. E.g., make sure that you are waiting on the message that you think you are waiting for, and that you haven't created a circular dependency between your MPI processes that means that everyone is waiting for something that will never happen (e.g., A is waiting for B, but B is waiting for A -- so no one ever does anything). MPI has rules about ordering and deadlocking -- be sure to see MPI-3.1 examples 3.7 and 3.8, for example. I know you're using Isend/Irecv, but that doesn't make you immune to application-level deadlocks (particularly when dealing with multi-threading code and inadvertently making assumptions about message ordering).

jsquyres avatar Nov 17 '18 14:11 jsquyres

Jsquyres, After I read those two examples, I indeed do think it might be possibly caused by buffer overflow. In my code, it is unknown how many msg will be sent from one process to another and i can't use Send/Recv but Isent/IRecv. In each iteration, one process will send out unknown number of msgs and then begin receiving msg. The last (ending) msg is unique and the process know if it has received all msgs and go to the next iterations. I also tried MPI_Issend and MPI_Ibsend, neither worked.

Is there a way to increase the buffer size? Thank you very much.

pflee2002 avatar Nov 18 '18 03:11 pflee2002

Which buffer are we talking about here ? If you can post a reproducer it will simplify the entire discussion.

Based on the code you provided I created a simple application. Even if I have multiple threads per process doing the loop I can't reproduce a deadlock. So either the code has been overly simplified to fit here, or there is something more obscure going on. In all cases a reproducer will be absolutely critical to have.

bosilca avatar Nov 18 '18 19:11 bosilca

Bosilca, thank you very much for this efforts. The buffer size I meant is the total ram used for sending/receiving p2p messages. I assume the message will be buffered (queued) in two computers if the message sending rate is much higher than the Ethernet can process. Given the massive number of MPI messages from one computer to another, not all the messages can be transmited instantaneously and they must be in buffers(queues).

It is hard to reproduce in small examples because this issue only occurs when a very large problem is to be solved. Using smaller examples it works well. For instance, i wish to simulate 30 million agents in a network and the system halted soon after it starts. After many replications, I noticed a pattern than the message loss occurs when a message is sent from one computer with larger RAM to the other computer with smaller RAM. One computer has 192 G RAM and the other has 65 G RAM and neither went to a swap memory yet when halted.

My speculation is that the the queuing memory (to host all unsent or unreceived messages) on the computers must have a limit and my large examples happen to exceed that given the message sending rate per second as opposed to the relatively slow connections. I don't know where that limit is.

Thanks

On Sun, Nov 18, 2018 at 1:12 PM bosilca [email protected] wrote:

Which buffer are we talking about here ? If you can post a reproducer it will simplify the entire discussion.

Based on the code you provided I created a simple application. Even if I have multiple threads per process doing the loop I can't reproduce a deadlock. So either the code has been overly simplified to fit here, or there is something more obscure going on. In all cases a reproducer will be absolutely critical to have.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-mpi/ompi/issues/6091#issuecomment-439717458, or mute the thread https://github.com/notifications/unsubscribe-auth/APbib2XMofS4EXoTTI_5Nowo8ThykhEPks5uwbEdgaJpZM4YnDNx .

pflee2002 avatar Nov 18 '18 20:11 pflee2002

I will also try to increase the physical RAM on one computer to see if this could be mitigated or totally solved. If so, the magic "buffer limit for all waiting messages" would be real.

pflee2002 avatar Nov 18 '18 21:11 pflee2002

MPI does not define receive fairness between multiple peers. If you above assumption is right (aka some ranks are overwhelmed by incoming messages), then adding some flow control should help. As an example changing some of the MPI_Isend to MPI_Issend (let say every 10 sends) should have fixed the issue, by limiting to 10 the number of pending messages per remote.

bosilca avatar Nov 19 '18 16:11 bosilca

@bosilca Thanks for the suggestion. I tried (adding a few Issend from time to time during simulation), It seemed not working. Still got stuck after a while. I am still uncertain what happened exactly.

pflee2002 avatar Nov 19 '18 21:11 pflee2002

Then buffering is not the problem. I would suggest you start digging into your app communication pattern.

bosilca avatar Nov 19 '18 22:11 bosilca

@bosilca Could you please why this change will exclude the buffering issue? Thanks

pflee2002 avatar Nov 25 '18 02:11 pflee2002

A synchronous send will block all future communications between 2 peers until the receiver receive the synchronous message. So, in an application where the communication pattern per round follows a star (all send to one) using synchronous sends allow the unique receiver to drain it's communication, and the maximum number of unexpected messages will be the number of peers times the interval between 2 consecutive synchronous messages.

I know this has been stated before but a small reproducer would really help.

bosilca avatar Nov 26 '18 04:11 bosilca

@bosilca After I configured the run script to deploy only one process at each computer, the halting symptom disappeared. The new configuration reduced the total RAM usage on each computer node by 50%, leaving enough memories for buffering without going into swap.

As such, my best bet is this is related to RAM size on computer nodes. My speculation is that the low level coding of OpenMPI must have some setting which can be violated only in extremely rare situation. My case happens to be one of them. Anyone could enlighten on this?

I guess the quickest solution is just simply adding new RAMs on one of my computers though it looks not intelligent :)

pflee2002 avatar Nov 26 '18 21:11 pflee2002

@pflee2002 your messages are extremely small, it seems unlikely to run out of memory. Let's make a quick computation. Let's assume that OMPI allocated 1KB for each message (this is an overestimate to account for the message itself, for the memory of the internal fragment and request, as well as the different list items to track the message). So each process will need about 1M messages to fill a 1GB of memory. This is true in all cases, independent on the number of processes per node. Now add this , plus the amount of memory used for each process and the number of processes per node and you can compute how many pending (and unmatched) messages can accumulate before you run out of memory.

It is also possible that we have a bug that only appears when the shared memory component is under heavy stress. But without a reproducer it will be difficult to find and assess.

bosilca avatar Nov 26 '18 22:11 bosilca

@pflee2002 Note that by "a small reproducer", we mean a relatively short program. That program can still send enormous numbers of potentially large messages. If your nodes are running into swapping issues while running, you may want to use some memory-checking tools (e.g., valgrind) to see where the memory usage is going. Is MPI consuming a large amount of memory (and growing/never freeing)? Or is there some other memory leak? Or ...?

jsquyres avatar Dec 01 '18 12:12 jsquyres

@jsquyres it seems the memory usage is stable after the initial memory allocation. In the meantime, I noticed a symptom. The following statement seems problematic.

MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &flag, &status1 ); If(flag!=0) then MPI_Ireceive.......

It came to my attention, that sometimes status 1 (MPI_Status type) begins to report status1.MPI_SOURCE==-1.

The program (with multi-threaded Isend and single-thread Ireceive) can run a while and the "status1" begins to return MPI_SOURCE=-1. I think this is possibly why some messages are missing.

I tried to add the following statement between Iprobe and Irecv, then the program report segment fault error:

            int count=0;
            MPI_Get_count(&status1,MPI_ACTION_TYPE,&count);
            if(count!=1)
            {
                printf("line 1998: Msg for 3rd step is messed up!\n");
                continue;
            }

Could you provide some suggestions on this phenomenon? It seems the Isend message was corrupted in transition.

Thank you

pflee2002 avatar Dec 04 '18 22:12 pflee2002

It is not safe to use probe+ireceive on a multithreaded application, especially if multiple threads drain the network. This is well documented and the standard explains in detail the potential issues. You should use mprobe+mrecv instead.

George

On Tue, Dec 4, 2018, 17:34 ActionLabTaylorLi <[email protected] wrote:

@jsquyres https://github.com/jsquyres it seems the memory usage is stable after the initial memory allocation. In the meantime, I noticed a symptom. The following statement seems problematic.

MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &flag, &status1 ); If(flag!=0) then MPI_Ireceive.......

It came to my attention, that sometimes status 1 (MPI_Status type) begins to report status1.MPI_SOURCE==-1.

The program (with multi-threaded Isend and single-thread Ireceive) can run a while and the "status1" begins to return MPI_SOURCE=-1. I think this is possibly why some messages are missing.

I tried to add the following statement between Iprobe and Irecv, then the program report segment fault error:

        int count=0;
        MPI_Get_count(&status1,MPI_ACTION_TYPE,&count);
        if(count!=1)
        {
            printf("line 1998: Msg for 3rd step is messed up!\n");
            continue;
        }

Could you provide some suggestions on this phenomenon? It seems the Isend message was corrupted in transition.

Thank you

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-mpi/ompi/issues/6091#issuecomment-444285512, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnOjQOQxJZlR8Z1GMRgfnn3dxo76l-dks5u1viHgaJpZM4YnDNx .

bosilca avatar Dec 04 '18 22:12 bosilca

@bosilca Thanks for the suggestions. I will investigate.

Talor

pflee2002 avatar Dec 04 '18 22:12 pflee2002

George, I tried three options a: MPI_Improbe+MPI_Imrecv+MPI_wait; b: MPI_Improbe+MPI_Mrecv; c: MPI_Mprobe + MPI_Mrecv.

Unfortunately, none of them worked for me . The below is code for Option C.

MPI_Mprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &message, &status); MPI_Mrecv(&received_data, 1, MPI_ACTION_TYPE,&message,&status);

I use a special type of message to tell the receivers to move forward. But it seems that message is always missing after a while and halt the whole program.

Thanks

pflee2002 avatar Dec 05 '18 04:12 pflee2002

It looks like this issue has sat around for a while with no responses; sorry.

Can you send us a short (but complete) program that reproduces the problem (including the Mprobe / Mrecv that @bosilca suggested)?

jsquyres avatar Mar 13 '19 20:03 jsquyres

It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.

github-actions[bot] avatar Feb 16 '24 21:02 github-actions[bot]

Per the above comment, it has been a month with no reply on this issue. It looks like this issue has been abandoned.

I'm going to close this issue. If I'm wrong and this issue is not abandoned, please feel free to re-open it. Thank you!

github-actions[bot] avatar Mar 01 '24 21:03 github-actions[bot]