OMPI5 --timeout parameter not killing job after timeout gets exceeded
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
v5.0.0
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Pulled from AWS EFA Installer v1.30.0
# On an AWS instance
curl -O https://efa-installer.amazonaws.com/aws-efa-installer-1.30.0.tar.gz
tar -xf aws-efa-installer-1.30.0.tar.gz && cd aws-efa-installer
sudo ./efa_installer.sh -y
module load openmpi5
Please describe the system on which you are running
This issue has been seen across the following systems:
centos7-hpc6a.48xlarge: test_imb[openmpi5-MPI1-Reduce_local]
debian10-c6gn.16xlarge: test_omb_collective[openmpi5-osu_iallreduce-host]
rhel7-hpc6a.48xlarge: test_imb[openmpi5-MPI1-Reduce_local]
rhel8-c6gn.16xlarge: test_imb[openmpi5-MPI1-Bcast]
- Computer hardware: see AWS Instance Types
- Network Type: EFA
Details of the problem
My teams OMPI5 jobs are being launched with a timeout --timeout 1800 or --timeout 3600 but we are seeing the job hang for a day. An example run command:
export PATH=/opt/amazon/openmpi5/bin:$PATH;export FI_EFA_USE_DEVICE_RDMA=1;export LD_LIBRARY_PATH=/home/ec2-user/tmp/PortaFiducia/build/libraries/libfabric/main/install/libfabric/lib;export FI_PROVIDER=efa;/opt/amazon/openmpi5/bin/mpirun --wdir . -n 192 --hostfile /home/ec2-user/tmp/PortaFiducia/hostfile --map-by ppr:96:node --timeout 1800 -x FI_EFA_USE_DEVICE_RDMA=1 -x LD_LIBRARY_PATH=/home/ec2-user/tmp/PortaFiducia/build/libraries/libfabric/main/install/libfabric/lib -x FI_PROVIDER=efa -x PATH /home/ec2-user/tmp/PortaFiducia/build/workloads/imb/openmpi-v5.0.0-installer/source/mpi-benchmarks-IMB-v2021.7/IMB-MPI1 Reduce_local -npmin 192 -iter 200 -time 20 -mem 1 2>&1 | tee node2-ppn96.txt
We are seeing the MPI jobs run for much longer than that:
2024-01-29 05:43:09] test_suites/libfabric/test_imb.py::test_imb[openmpi5-MPI1-Reduce_local] 2024-01-30 01:38:08,827 - WARNING - test_orchestrator - Test is being timed out...
2024-01-30 01:38:08,827 - INFO - test_orchestrator - Stopping timer...
The run logs currently don't get saved by our CI system in the event of a timeout (so I don't have better logs as of now), but I am working on that and will update the ticket when I get them.
We don't see this with OMPI4 (but that might just mean it isn't hanging in this way). This is not consistent behavior that we see (some sort of race).
The symptom aligns with https://github.com/open-mpi/ompi/issues/12064
The issue has been fixed in 5.0.1
My concern is that the timeout failed to kill the job, I understand the hang itself has been fixed in 5.0.1.
We will re-evaluate the issue after 5.0.2 release. The timeout functionality is also implemented in prrte, so it is possible the hang fix also resolves this issue.
FWIW: working fine in PRRTE master
Issue not observed in 5.0.2. Resolving.