ompi icon indicating copy to clipboard operation
ompi copied to clipboard

OMPI5 --timeout parameter not killing job after timeout gets exceeded

Open a-szegel opened this issue 1 year ago • 4 comments

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v5.0.0

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Pulled from AWS EFA Installer v1.30.0

# On an AWS instance
curl -O https://efa-installer.amazonaws.com/aws-efa-installer-1.30.0.tar.gz
tar -xf aws-efa-installer-1.30.0.tar.gz && cd aws-efa-installer
sudo ./efa_installer.sh -y
module load openmpi5

Please describe the system on which you are running

This issue has been seen across the following systems:

centos7-hpc6a.48xlarge: test_imb[openmpi5-MPI1-Reduce_local]
debian10-c6gn.16xlarge: test_omb_collective[openmpi5-osu_iallreduce-host] 
rhel7-hpc6a.48xlarge: test_imb[openmpi5-MPI1-Reduce_local] 
rhel8-c6gn.16xlarge: test_imb[openmpi5-MPI1-Bcast] 

Details of the problem

My teams OMPI5 jobs are being launched with a timeout --timeout 1800 or --timeout 3600 but we are seeing the job hang for a day. An example run command:

export PATH=/opt/amazon/openmpi5/bin:$PATH;export FI_EFA_USE_DEVICE_RDMA=1;export LD_LIBRARY_PATH=/home/ec2-user/tmp/PortaFiducia/build/libraries/libfabric/main/install/libfabric/lib;export FI_PROVIDER=efa;/opt/amazon/openmpi5/bin/mpirun --wdir . -n 192 --hostfile /home/ec2-user/tmp/PortaFiducia/hostfile --map-by ppr:96:node --timeout 1800 -x FI_EFA_USE_DEVICE_RDMA=1 -x LD_LIBRARY_PATH=/home/ec2-user/tmp/PortaFiducia/build/libraries/libfabric/main/install/libfabric/lib -x FI_PROVIDER=efa -x PATH  /home/ec2-user/tmp/PortaFiducia/build/workloads/imb/openmpi-v5.0.0-installer/source/mpi-benchmarks-IMB-v2021.7/IMB-MPI1 Reduce_local -npmin 192 -iter 200 -time 20 -mem 1 2>&1 | tee node2-ppn96.txt

We are seeing the MPI jobs run for much longer than that:

2024-01-29 05:43:09] test_suites/libfabric/test_imb.py::test_imb[openmpi5-MPI1-Reduce_local] 2024-01-30 01:38:08,827 - WARNING - test_orchestrator - Test is being timed out...
2024-01-30 01:38:08,827 - INFO - test_orchestrator - Stopping timer...

The run logs currently don't get saved by our CI system in the event of a timeout (so I don't have better logs as of now), but I am working on that and will update the ticket when I get them.

We don't see this with OMPI4 (but that might just mean it isn't hanging in this way). This is not consistent behavior that we see (some sort of race).

a-szegel avatar Feb 06 '24 03:02 a-szegel

The symptom aligns with https://github.com/open-mpi/ompi/issues/12064

The issue has been fixed in 5.0.1

wenduwan avatar Feb 06 '24 04:02 wenduwan

My concern is that the timeout failed to kill the job, I understand the hang itself has been fixed in 5.0.1.

a-szegel avatar Feb 06 '24 06:02 a-szegel

We will re-evaluate the issue after 5.0.2 release. The timeout functionality is also implemented in prrte, so it is possible the hang fix also resolves this issue.

wenduwan avatar Feb 06 '24 16:02 wenduwan

FWIW: working fine in PRRTE master

rhc54 avatar Feb 06 '24 18:02 rhc54

Issue not observed in 5.0.2. Resolving.

wenduwan avatar Feb 22 '24 17:02 wenduwan