Performance Issues with MPIRun Due to Virtual Network Interfaces
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using?
v4.1.7rc1
Describe how Open MPI was installed
installed by MLNX_OFED
Please describe the system on which you are running
- Operating system/version: Ubuntu 22.04
- Computer hardware: Intel(R) Xeon(R) Platinum 8480+
- Network type: Eth and mellanox
Details of the problem
I have a server with a single network card that has virtualized over 200 network interfaces. This causes significant delays when using mpirun, as the process hangs for a long time. I used UCX debug and found that the delays are primarily occurring on the bridged network interface.
Is there a solution for this issue? Any recommendations on how to optimize or configure the network interfaces to improve the performance of mpirun? Thank you!
ucx log
[1742871925.548610] [pod-hpc-02:1702645:0] tcp_iface.c:945 UCX DEBUG filtered out bridge device virbr0
[1742872077.918760] [pod-hpc-02:1702645:0] tcp_iface.c:945 UCX DEBUG filtered out bridge device wlan
UCX_NET_DEVICES is your friend here. Set it to the interface you do intent to use, and this issue shall go away.
UCX_NET_DEVICESis your friend here. Set it to the interface you do intent to use, and this issue shall go away.
Thanks for your reply!
Actually I have set the UCX_NET_DEVICES to another nic (tcp),but it seems it took effects after the following 2 line log. I'm so confused.
Finally I spent 4 times as long to complete my tests.
I think I see the problem: uct_tcp_query_devices scans through all the interfaces, build a list of active and non-bridged interfaces and then trim it to the user requested devices. On a system with hundreds virtual interfaces, this is a very costly process as it involves many syscall for each interface.
This is not something we can fix in OMPI, it should be reported and addressed directly in UCX. @janjust @yosefe should be able to help.