mpich icon indicating copy to clipboard operation
mpich copied to clipboard

libmpi size unusually large when compiling with cuda support

Open lzacchi opened this issue 2 years ago • 5 comments

I work on a project that supports different architectures, so we've built 3 separate versions of MPICH 4.1.1 (cuda11.2, rocm5.4.3 and cpu-only with ucx1.14.1)

The cpu-only and the rocm5 libraries are around 50MB each. The CUDA version on the other hand, is close to 1.8GB! There doesn’t seem to be anything out of the ordinary with the builds and all versions are working as expected. However, this really complicates packaging and distribution, since we provide both the archive and the shared libraries, I am looking at more than 3GB for MPICH only.

This is the output of du -sh performed on the install directories:

First the rocm5 build:

$ du -sh LINUX_gcc9.3_glibc2.28_rocm5.4.3_ucx1.14.1/lib/*
(...)
53M	LINUX_gcc9.3_glibc2.28_rocm5.4.3_ucx1.14.1/lib/libmpi.a
40M	LINUX_gcc9.3_glibc2.28_rocm5.4.3_ucx1.14.1/lib/libmpi.so.12.3.0

And then the cuda build:

$ du -sh LINUX_gcc9.3_glibc2.17_cuda11.2_ucx1.14.1/lib/*
(...)
1.8G	LINUX_gcc9.3_glibc2.17_cuda11.2_ucx1.14.1/lib/libmpi.a
1.7G	LINUX_gcc9.3_glibc2.17_cuda11.2_ucx1.14.1/lib/libmpi.so.12.3.0

lzacchi avatar Sep 15 '23 09:09 lzacchi

I have been investigating this. I see that the PTX code generation happens for compute capability 5.2 through 8.6, which may be excessive for most deployments, particularly since MPICH is less likely to be used on Tegra systems.

I don't know how to pass it directly from the MPICH configure invocation but Yaksa's configure has --with-cuda-sm= that seems to default to all, but would probably be better off defaulting to the NVCC option equivalent to -arch=all-major, unless someone has evidence that the minor version architectural optimizations are significant to Yaksa.

Given that data center deployments of MPICH are always on GPUs with compute capability 6.0 (rarely now), 7.0 (V100) or 8.0 (A100), with 9.0 (H100) coming online now, the motivation for including specialization for Tegra and GTX/RTX is not strong.

jeffhammond avatar Sep 18 '23 14:09 jeffhammond

./autogen.sh -yaksa-depth=N will solve this. Below are the default (N=3) along with N=1 and N=2. The size of the library is approximately 8e6 * 6.105 ** N, so there is a pretty strong incentive to reduce the Yaksa depth.

$ ll mpich-cuda-install/lib/libmpi.a mpich-yaksa-1-install/lib/libmpi.a mpich-yaksa-2-install/lib/libmpi.a
-rw-r--r-- 1 jehammond domain-users 1,8G syys   18 16:54 mpich-cuda-install/lib/libmpi.a
-rw-r--r-- 1 jehammond domain-users  48M syys   18 16:41 mpich-yaksa-1-install/lib/libmpi.a
-rw-r--r-- 1 jehammond domain-users 283M syys   18 15:59 mpich-yaksa-2-install/lib/libmpi.a

jeffhammond avatar Sep 18 '23 14:09 jeffhammond

If you build with --with-cuda-sm=89, for example, it cuts the size down to 255M. I don't know what your deployment targets are, but you might be able to get away with --with-cuda-sm=70 if you don't target anything older than V100.

jeffhammond avatar Sep 18 '23 14:09 jeffhammond

I'm not convinced that your ROCm build included AMD GPU support. I built with AMD GPU support on LUMI (MI-250x) and see a similar sized library to the specific-SM build noted above, and when I disable HIP/ROCm, I see a library size similar to what you see without GPU support.

jhammond@uan03:/tmp> ll jhammond-mpich-install*/lib/libmpi.a
-rw-r--r-- 1 jhammond  56M Sep 18 18:32 jhammond-mpich-install-2/lib/libmpi.a
-rw-r--r-- 1 jhammond 229M Sep 18 18:25 jhammond-mpich-install/lib/libmpi.a

jhammond@uan03:/tmp> ./jhammond-mpich-install/bin/mpichversion
MPICH Version:      4.2a1
MPICH Release date: unreleased development copy
MPICH ABI:          0:0:0
MPICH Device:       ch4:ofi
MPICH configure:    --prefix=/tmp/jhammond-mpich-install --with-hip=/opt/rocm/hip --with-hip-sm=auto --with-device=ch4:ofi
MPICH CC:           gcc    -O2
MPICH CXX:          g++   -O2
MPICH F77:          gfortran   -O2
MPICH FC:           gfortran   -O2
MPICH features:     threadcomm

jhammond@uan03:/tmp> ./jhammond-mpich-install-2/bin/mpichversion
MPICH Version:      4.2a1
MPICH Release date: unreleased development copy
MPICH ABI:          0:0:0
MPICH Device:       ch4:ofi
MPICH configure:    --prefix=/tmp/jhammond-mpich-install-2 --with-device=ch4:ofi
MPICH CC:           gcc    -O2
MPICH CXX:          g++   -O2
MPICH F77:          gfortran   -O2
MPICH FC:           gfortran   -O2
MPICH features:     threadcomm

jeffhammond avatar Sep 18 '23 15:09 jeffhammond

It looks like you can prune your build already with e.g. --with-cuda-sm=60,70,80, for example, which would get you P100, V100 and A100, plus reasonable support for the related derivatives.

jeffhammond avatar Sep 18 '23 18:09 jeffhammond