ompi
ompi copied to clipboard
Configuring OpenMPI to Use a Specific TCP Port for Communication Between Nodes
Hello everyone,
I have a problem. Let's assume we have two nodes with OpenMPI installed on them. Now, we want to run a parallel program in a way that the first node sends data to the second node through a specific TCP port (which is the one I desire). It's worth mentioning that I have already tried using the --mca flag with the attributes btl_tcp_port_min_v4, btl_tcp_port_range_v4, and ob_tcp_port_range_v4, but it didn't yield any results.
It seems that OpenMPI still uses random ports for communication between the source and destination processes. I would greatly appreciate it if someone could suggest a solution or workaround to configure OpenMPI to use a specific TCP port for communication between nodes.
Thank you in advance for your assistance!
The only thing you control via these MCA parameters is the port for the listening fd, all other ports are decided by the kernel.
The only thing you control via these MCA parameters is the port for the listening fd, all other ports are decided by the kernel.
Thank you for your response. Is there any way to control the ports used by the kernel?
I think what @bosilca is saying is that Open MPI's MCA params only affect its listening ports -- not its connecting ports.
What's your use case -- do you have a need for restricting the connecting ports?
I think what @bosilca is saying is that Open MPI's MCA params only affect its listening ports -- not its connecting ports.
What's your use case -- do you have a need for restricting the connecting ports?
I have developed a simple parallel program for sending and receiving using the partition communication technique. I want to route the TCP packets generated by OpenMPI towards a network processor simulator. This simulator operates on a specific port. This means I need to transfer the packets over a designated port, and for this reason, I need to determine the port used by OpenMPI before executing mpirun. What solution do you suggest?
I have found another solution that can resolve my issue. If I can capture and save the TCP packets generated by OpenMPI in a tcpdump format before they reach the kernel, it would address my problem. Do you have any suggestions for this scenario? Please note that the tcpdump file generated by Wireshark does not solve my problem either.
It sounds like you are filtering packets based on the source TCP port (which is random). Can you filter based on the target TCP port, instead?
In general, the TCP support in Open MPI is architected like many other TCP peer-to-peer applications: it opens sockets to peer IP address + port combinations. Since MPI is typically used in high-performance scenarios, it does not include proxy support (which would just introduce latency and performance loss), and therefore does not have the notion of sending packets to something other than the intended peer.
You might want to consider a general network redirection mechanism at something lower than userspace, or perhaps even in the network infrastructure itself. Such mechanisms, of course, would require administrator-level access.
In general, the TCP support in Open MPI is architected like many other TCP peer-to-peer applications: it opens sockets to peer IP address + port combinations. Since MPI is typically used in high-performance scenarios, it does not include proxy support (which would just introduce latency and performance loss), and therefore does not have the notion of sending packets to something other than the intended peer.
You might want to consider a general network redirection mechanism at something lower than userspace, or perhaps even in the network infrastructure itself. Such mechanisms, of course, would require administrator-level access.
Thank you for your response. Let me explain another solution for my issue. If I can save the TCP packets generated by OpenMPI in a standardized format like tcpdump or any other standard format before they reach lower layers, my problem will be solved. What ideas do you have regarding this matter?
Open MPI does not currently do that (funnel packets off to be saved). As I mentioned, MPI is typically used in high-performance environments; copying packets to a secondary location just adds overhead.
That being said, Open MPI is open source. If you'd like to propose a patch for this kind of functionality, we'd be happy to consider it -- you'll want to look in the opal/mca/btl/tcp
directory for MPI communications over TCP (there's other TCP communication, too, but that's more complicated). We'd want to ensure that this functionality does not add any performance overhead in the case where it is not used.
Open MPI does not currently do that (funnel packets off to be saved). As I mentioned, MPI is typically used in high-performance environments; copying packets to a secondary location just adds overhead.
That being said, Open MPI is open source. If you'd like to propose a patch for this kind of functionality, we'd be happy to consider it -- you'll want to look in the
opal/mca/btl/tcp
directory for MPI communications over TCP (there's other TCP communication, too, but that's more complicated). We'd want to ensure that this functionality does not add any performance overhead in the case where it is not used.
You provided excellent explanations, Jeff. It's great that you pay attention to adding new features. Enabling the ability to save TCP packets generated by MPI is really helpful for people like me who want to offload some of the MPI-related tasks and rely on NIC. It would be great if you could implement this feature in a way that allows us to use it with a flag. This way, there won't be any additional overhead added to OpenMPI unless the flag is enabled
How would OMPI know what the packets look like ? Moreover, I don't think there is a way to force a specific port on the socket returned from accept/connect. Also, OMPI does not know what the TCP packets look like (yes we could forge one but I don't see why should we). There are specialized tools that can be used to do this job, please use them instead.
Moreover, I read this thread multiple times and I still puzzled by what exactly is the goal of all this ? I understand that you want an MPI application to send data to a specific port, but I don't understand why ? If on the other side you don't have an OMPI application it will not be able to understand the wire protocol we have, and if its an OMPI application then why not letting them use whatever port has been assigned to them ?
@bosilca even if I still do not understand the end goal here, IIRC it is possible to bind a socket to a specific port/ip (my understanding is we already bind to a specific IP but with port 0
) before invoking connect()
. accept()
does not change the port so I am not sure of what you meant by "accept/connect".
We already support the use of a range for "bind() before listen()" (see mca_btl_tcp_component_create_listen()
) so I guess it would be technically feasible to do similar stuff for "bind() before connect()" (see mca_btl_tcp_endpoint_start_connect()
)
That being said, I do not think we can achieve better than providing a port range (e.g. there won't be any deterministic way to know before invoking mpirun
which ports will be used by which connection pair.) so I am not sure whether what will help.
I shall better know I wrote a big chunk of that code :cold_sweat: What I don't understand is what more can we offer taking in account what we already offer and why what we have so far is not enough for what goal.
@AlirezaGhanavati I'll echo what my colleagues @bosilca and @ggouaillardet stated above. And also, to your point:
Enabling the ability to save TCP packets generated by MPI is really helpful for people like me who want to offload some of the MPI-related tasks and rely on NIC.
I don't understand this statement. Open MPI supports many NIC offload methods and protocols -- most (all) of which are not based on TCP. Offloading TCP has been tried by many different vendors over the past ~20 years; it's extremely hard (because supporting ALL of TCP in hardware is both technically difficult and space-dependent if you want to scale the allowable number of simultaneous connections -- see the shortcomings of iWARP, for example). As such, typical HPC NIC offload mechanisms do not use TCP.
Per what was stated above, we don't have any intention of creating a mechanism to save what we would have sent across a TCP socket down into an alternate location (e.g., a file). But Open MPI is open source; if you want to propose a pull request with such functionality, we'll have a look at it. But, as I stated above, it cannot impact the performance of the existing TCP code. Additionally, as @bosilca mentioned, Open MPI just writes data across TCP sockets; we don't have any idea how the OS/lower layers packetize that or what the resulting TCP packets will look like.
It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.
Per the above comment, it has been a month with no reply on this issue. It looks like this issue has been abandoned.
I'm going to close this issue. If I'm wrong and this issue is not abandoned, please feel free to re-open it. Thank you!