ompi
ompi copied to clipboard
ucx: meet at the pmix fence before disconnecting to avoid an infinite loop
While moving a job from a small number of GPU nodes to a larger number of CPU nodes, I was able to reliably reproduce #11087 in my environment. While in the debugger, I found that opal_common_ucx_mca_pmix_fence was spinning forever waiting to become fenced. Calls down to the UCX layer showed that it had no pending operations, no active endpoints, and no outstanding flushes. Given that UCX is the transport that allows them to synchronize in this case, it doesn't make any sense to fence after disconnecting. Reversing the order of operations resolved the shutdown hang.
This fixes #11087 This might fix https://github.com/openucx/ucx/issues/8738