mpich
mpich copied to clipboard
libmpi size unusually large when compiling with cuda support
I work on a project that supports different architectures, so we've built 3 separate versions of MPICH 4.1.1 (cuda11.2, rocm5.4.3 and cpu-only with ucx1.14.1)
The cpu-only and the rocm5 libraries are around 50MB each. The CUDA version on the other hand, is close to 1.8GB! There doesn’t seem to be anything out of the ordinary with the builds and all versions are working as expected. However, this really complicates packaging and distribution, since we provide both the archive and the shared libraries, I am looking at more than 3GB for MPICH only.
This is the output of du -sh performed on the install directories:
First the rocm5 build:
$ du -sh LINUX_gcc9.3_glibc2.28_rocm5.4.3_ucx1.14.1/lib/*
(...)
53M LINUX_gcc9.3_glibc2.28_rocm5.4.3_ucx1.14.1/lib/libmpi.a
40M LINUX_gcc9.3_glibc2.28_rocm5.4.3_ucx1.14.1/lib/libmpi.so.12.3.0
And then the cuda build:
$ du -sh LINUX_gcc9.3_glibc2.17_cuda11.2_ucx1.14.1/lib/*
(...)
1.8G LINUX_gcc9.3_glibc2.17_cuda11.2_ucx1.14.1/lib/libmpi.a
1.7G LINUX_gcc9.3_glibc2.17_cuda11.2_ucx1.14.1/lib/libmpi.so.12.3.0
I have been investigating this. I see that the PTX code generation happens for compute capability 5.2 through 8.6, which may be excessive for most deployments, particularly since MPICH is less likely to be used on Tegra systems.
I don't know how to pass it directly from the MPICH configure invocation but Yaksa's configure has --with-cuda-sm= that seems to default to all, but would probably be better off defaulting to the NVCC option equivalent to -arch=all-major, unless someone has evidence that the minor version architectural optimizations are significant to Yaksa.
Given that data center deployments of MPICH are always on GPUs with compute capability 6.0 (rarely now), 7.0 (V100) or 8.0 (A100), with 9.0 (H100) coming online now, the motivation for including specialization for Tegra and GTX/RTX is not strong.
./autogen.sh -yaksa-depth=N will solve this. Below are the default (N=3) along with N=1 and N=2. The size of the library is approximately 8e6 * 6.105 ** N, so there is a pretty strong incentive to reduce the Yaksa depth.
$ ll mpich-cuda-install/lib/libmpi.a mpich-yaksa-1-install/lib/libmpi.a mpich-yaksa-2-install/lib/libmpi.a
-rw-r--r-- 1 jehammond domain-users 1,8G syys 18 16:54 mpich-cuda-install/lib/libmpi.a
-rw-r--r-- 1 jehammond domain-users 48M syys 18 16:41 mpich-yaksa-1-install/lib/libmpi.a
-rw-r--r-- 1 jehammond domain-users 283M syys 18 15:59 mpich-yaksa-2-install/lib/libmpi.a
If you build with --with-cuda-sm=89, for example, it cuts the size down to 255M. I don't know what your deployment targets are, but you might be able to get away with --with-cuda-sm=70 if you don't target anything older than V100.
I'm not convinced that your ROCm build included AMD GPU support. I built with AMD GPU support on LUMI (MI-250x) and see a similar sized library to the specific-SM build noted above, and when I disable HIP/ROCm, I see a library size similar to what you see without GPU support.
jhammond@uan03:/tmp> ll jhammond-mpich-install*/lib/libmpi.a
-rw-r--r-- 1 jhammond 56M Sep 18 18:32 jhammond-mpich-install-2/lib/libmpi.a
-rw-r--r-- 1 jhammond 229M Sep 18 18:25 jhammond-mpich-install/lib/libmpi.a
jhammond@uan03:/tmp> ./jhammond-mpich-install/bin/mpichversion
MPICH Version: 4.2a1
MPICH Release date: unreleased development copy
MPICH ABI: 0:0:0
MPICH Device: ch4:ofi
MPICH configure: --prefix=/tmp/jhammond-mpich-install --with-hip=/opt/rocm/hip --with-hip-sm=auto --with-device=ch4:ofi
MPICH CC: gcc -O2
MPICH CXX: g++ -O2
MPICH F77: gfortran -O2
MPICH FC: gfortran -O2
MPICH features: threadcomm
jhammond@uan03:/tmp> ./jhammond-mpich-install-2/bin/mpichversion
MPICH Version: 4.2a1
MPICH Release date: unreleased development copy
MPICH ABI: 0:0:0
MPICH Device: ch4:ofi
MPICH configure: --prefix=/tmp/jhammond-mpich-install-2 --with-device=ch4:ofi
MPICH CC: gcc -O2
MPICH CXX: g++ -O2
MPICH F77: gfortran -O2
MPICH FC: gfortran -O2
MPICH features: threadcomm
It looks like you can prune your build already with e.g. --with-cuda-sm=60,70,80, for example, which would get you P100, V100 and A100, plus reasonable support for the related derivatives.