mpich
mpich copied to clipboard
mpich=4.1.1 is slower than mpich=4.0.1
Hi there,
I am a developer of Modin and unidist. We use MPI as a backend in unidist, and mpich can be used as one of the implementations in that backend. We did measurements for mpich=4.0.1 and mpich=4.1.1 both built from source. There is a decent slowdown in mpich=4.1.1 against mpich=4.0.1.
The original benchmarks as well as Modin and unidist are in Python. I don't have much time to provide you with a reproducer written in C but I wonder if anyone is aware of a performance issue in mpich=4.1.1 against mpich=4.0.1?
Thanks in advance.
What is your execution environment? Single node or multinode? If multinode, what kind of interconnect? Do you know what the MPICH configuration is for the library you are using?
Also, please share the measurement results. If you have a reproducer in Python with Modin and unidist, please share them as well.
What is your execution environment? Single node or multinode? If multinode, what kind of interconnect? Do you know what the MPICH configuration is for the library you are using?
This is a single node case (see the output of cat /etc/os-release and lscpu below). During the measurements we used mpich through mpi4py. What kind of MPICH configuration do you mean?
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 112
On-line CPU(s) list: 0-111
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz
CPU family: 6
Model: 85
Thread(s) per core: 2
Core(s) per socket: 28
Socket(s): 2
Stepping: 7
CPU max MHz: 4000.0000
CPU min MHz: 1000.0000
BogoMIPS: 4400.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc ar
t arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid
dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single ssbd mba i
brs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed
adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln p
ts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 1.8 MiB (56 instances)
L1i: 1.8 MiB (56 instances)
L2: 56 MiB (56 instances)
L3: 77 MiB (2 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-27,56-83
NUMA node1 CPU(s): 28-55,84-111
Vulnerabilities:
Itlb multihit: KVM: Mitigation: VMX disabled
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Retbleed: Mitigation; Enhanced IBRS
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Srbds: Not affected
Tsx async abort: Mitigation; TSX disabled
Also, please share the measurement results. If you have a reproducer in Python with Modin and unidist, please share them as well.
I am not sure that the reproducers in Python would be helpful anyhow for you because the calls of MPI routines are hidden deep inside unidist. One of the examples is census benchmark, where we use Modin for data processing and Modin in turn uses unidist on MPI to distribute computation. The measurements for 30 worker processes are below. In unidist we mostly use the following MPI routines to send/recv messages between processes: send, isend, Send, Isend, recv, irecv, Recv. The full list of communication routines can be found in https://github.com/modin-project/unidist/blob/master/unidist/core/backends/mpi/core/communication.py.
| time (s) | mpich=4.0.1 | mpich=4.1.1 |
|---|---|---|
| etl | 30.44447 | 70.76703 |
| read | 33.84483 | 34.33605 |
@YarShev Is the test multi-threaded (i.e. using MPI_THREAD_MULTIPLE)?
Yes, the thread level is MPI_THREAD_MULTIPLE.
Try configuring mpich with --enable-thread-cs=global and see if you can recover the performance. One change we made from 4.0 to 4.1 is switching from global lock to per-vci lock. Per-vci lock enables some advanced features, but some applications may have worse performance due to more granular locks.
I configured mpich with the following command and the performance didn't recover.
./configure --prefix=<MY_PATH> --enable-thread-cs=global
Btw, is there a guide on how to configure mpich as much efficient as possible?
Indeed, there are performance issues with the newest version.
I'm doing some performance tests for a paper. Two of the test cases are: an application run on a single host (mpirun -np [N] -hosts 127.0.0.1) and the same application run on all available cluster hosts (mpirun -np [N] -hostfile file_with_all_hosts). The "cluster" is actually a group of Docker containers and a single container simulates a single cluster node. All containers run on a single computer but each one has their own virtual network interface. It appears that when I run the application on multiple hosts, randomly the execution time is nearly twice as long compared to the localhost case. You'd say this is probably due to my application extensive use of network resources. However, this does not happen with the OpenMPI implementation and the network communication overhead appears to be negligible there.
Whenever the execution time increases, one can also notice a huge increase in the time spent in the OS kernel (I mean the sys measure output by the bash time feature). I used valgrind's callgrind tool to profile my application because I suspected a hotspot in my own piece of code. However, callgrind logs point to MPICH library functions.
There may be a problem with MPIR_Wait_state(). If a profiled application happens to run slower, you can notice an enormous increase in the number of MPIDI_progress_test() calls issued by MPIR_Wait_state(). Unfortunately I am unable to analyze the MPICH source code properly, but it seems that there some kind of polling takes place (if a request has been completed?), but for some unkown reason the request occassionally takes a much longer time to finish (as if there was not preemption?).
The problem is gone after downgrading to 3.4.3.
cc @raffenet , @hzhou
@pikrog Could you try set FI_PROVIDER=tcp? Anyway, your issue is different form OP, if the issue persist, please open a new issue for better tracking.
cc @raffenet , @hzhou
Sorry for the delay in response. I haven't got a chance to reproduce your issue yet. I am surprised that the global critical section didn't make a difference since that is the main change between 4.0 and 4.1 as far as send/recv goes.
@hzhou, could you also answer this question?
Btw, is there a guide on how to configure mpich as much efficient as possible?
@hzhou, could you also answer this question?
Btw, is there a guide on how to configure mpich as much efficient as possible?
There are indeed many knobs that one can tune, but that always depends on the specific system and the characteristics of applications. For any settings that we identify as generally beneficial, we would have made them default already. You always can go through the wiki pages (https://github.com/pmodels/mpich/blob/main/doc/wiki/Index.md) for more understanding.
@hzhou, got it, thanks! I will take a look.
For single node shared memory performance, switching from OFI to UCX likely makes a huge difference. Shared-memory support in OFI is behind UCX. I don't know the details, but have observed the difference a number of times.
This should be independent of MPICH version and thus does not explain the issue here.
I am not sure that the reproducers in Python would be helpful anyhow for you because the calls of MPI routines are hidden deep inside unidist.
@YarShev one of the nice things about the PMPI profiling interface is that it allows one to capture complete details of every MPI call, regardless of where in the stack it happens. I don't know if one exists already but writing something equivalent to MKL_VERBOSE (dump all calls, the arguments and the times) can be written using https://github.com/LLNL/wrap in a few minutes.
@jeffhammond,
For single node shared memory performance, switching from OFI to UCX likely makes a huge difference. Shared-memory support in OFI is behind UCX. I don't know the details, but have observed the difference a number of times.
I also see the difference on a benchmark but not a number of times (only 3-4 percent).
@YarShev one of the nice things about the PMPI profiling interface is that it allows one to capture complete details of every MPI call, regardless of where in the stack it happens. I don't know if one exists already but writing something equivalent to MKL_VERBOSE (dump all calls, the arguments and the times) can be written using https://github.com/LLNL/wrap in a few minutes.
Thanks for letting us know this.