Grid
Grid copied to clipboard
MPI2 romio321 library fails when reading >= 2GB per rank
Git commit
develop HEAD 135808dcfa767edf988976ae31d2876bb6389f8b
Target Platform
University of Edinburgh Extreme Scaling system “Tursa”
Each node: 2 x AMD ROME EPYC 32, Nvidia A100 (40GB), 1TB RAM
Linux tursa-login1 4.18.0-305.10.2.el8_4.x86_64 #1 SMP Mon Jul 12 04:43:18 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
Configure
../configure --enable-comms=mpi --enable-simd=GPU --enable-shm=nvlink --enable-gen-simd-width=64 --enable-accelerator=cuda --enable-accelerator-cshift --enable-unified \
--with-gmp=/mnt/lustre/tursafs1/home/dp207/dp207/shared/env/spack/opt/spack/linux-rhel8-zen2/gcc-9.4.0/gmp-6.2.1-4qzl4yfdllwmf42zewg44gb4y54bgy2d \
--with-mpfr=/mnt/lustre/tursafs1/home/dp207/dp207/shared/env/spack/opt/spack/linux-rhel8-zen2/gcc-9.4.0/mpfr-4.1.0-agsa52nljiqbbrzrpln5ebgclzxesm7a \
--with-fftw=/mnt/lustre/tursafs1/home/dp207/dp207/shared/env/spack/opt/spack/linux-rhel8-zen2/gcc-9.4.0/fftw-3.3.10-bdpumbnknoewgtzgirxrvy3weveminw3 \
--with-hdf5=/mnt/lustre/tursafs1/home/dp207/dp207/shared/env/spack/opt/spack/linux-rhel8-zen2/gcc-9.4.0/hdf5-1.10.7-qld75yuu7gpncparpqq46hvuqzz4s6zx \
--with-lime=/mnt/lustre/tursafs1/home/dp207/dp207/shared/env/spack/opt/spack/linux-rhel8-zen2/gcc-9.4.0/c-lime-2-3-9-ie76iwlrgadc24aniq57wz5rv7dmt4b4 \
CXX=nvcc \
CXXFLAGS='-ccbin mpicxx -gencode arch=compute_80,code=sm_80 -std=c++14 -cudart shared -I/mnt/lustre/tursafs1/apps/basestack/cuda-11.4/openmpi/4.1.1-cuda11.4/include’ \
LDFLAGS='-cudart shared -L/mnt/lustre/tursafs1/apps/basestack/cuda-11.4/openmpi/4.1.1-cuda11.4/lib’ \
LIBS='-lrt -lmpi’ \
--prefix=/mnt/lustre/tursafs1/home/dp207/dp207/shared/runs/semilep/code/3/Prefix
Attachments
- config.log
- grid.configure.summary
- GridMakeV1.txt Output from make V=1
- MPIRead32.cpp Minimal reproducer available https://github.com/mmphys/MPIRead32
- Bad.log Minimal reproducer output showing issue
- Good.log Minimal reproducer output showing workaround
- GaugeLoad.cpp Reproducer using Grid to load gauge field
- Bad.log Grid reproducer output showing issue
- Good.log Grid reproducer output showing workaround
Issue Description
When MPI2 is configured to use the romio321 library for I/O, MPI_File_read_all() fails when reading >=2GB into a single MPI rank.
Issue Workaround
Other MPI2 I/O libraries do not have this limit / bug. Switching to ompio for example resolves the issue on Tursa.
Note: romio321 is currently the recommended MPI2 I/O library on Tursa. Commissioning performance tests were carried out using romio321. I see a performance hit when using ompio (~5 GBPS) instead of romio321 (~10 GBPS) on a single node, but I have not tested to see how this scales.
Minimal reproducer -- MPIRead32.cpp
MPIRead32.cpp https://github.com/mmphys/MPIRead32 is the minimal code to reproduce the issue. Note, this is independent of Grid.
To demonstrate the issue we run the following command on Tursa:
mpirun --mca io romio321 -np 2 MPIRead32 a.out 0 2.1 2304.4608 &> Bad.log
Re-running the same command, but this time choosing the ompio I/O library works around the issue:
mpirun --mca io ompio -np 2 MPIRead32 a.out 0 2.1 2304.4608 > Good.log
Grid reproducer -- GaugeLoad.cpp
The issue was first noticed on Tursa when using Grid to load a Gauge field.
To demonstrate the issue we run the following command on Tursa:
mpirun --mca io romio321 -np 2 GaugeLoad /mnt/lustre/tursafs1/home/dp207/dp207/shared/dwf_2+1f/F1M/ckpoint_EODWF_lat.200 --grid 48.48.48.96 --mpi 2.1.1.1 &> GridBad.log
Re-running the same command, but this time choosing the ompio I/O library works around the issue:
mpirun --mca io ompio -np 2 GaugeLoad /mnt/lustre/tursafs1/home/dp207/dp207/shared/dwf_2+1f/F1M/ckpoint_EODWF_lat.200 --grid 48.48.48.96 --mpi 2.1.1.1 > GridGood.log
config.log grid.configure.summary.log GridMakeV1.txt MPIRead32.cpp.txt Bad.log Good.log GaugeLoad.cpp.txt GridBad.log GridGood.log
Sorry to hear you are running into problems with ROMIO from MPICH-3.2.1
The patch which promotes the offending datatype to a 64 bit value is this one: https://github.com/pmodels/mpich/commit/3a479ab0 though it might not be worth backporting to whichever version of OpenMPI you are running: Openmpi has updated their ROMIO to 3.4.1 which should contain the fix.
Thanks for the pointer to the fix. Will ask whether we can update Tursa to Open MPI's ROMIO 3.4.1.