ucx
ucx copied to clipboard
Can ucx support to deal with SIGSTOP and SIGCONT?
I'm using openmpi with ucx 1.10.1, using ud tls.
I want to suspend MPI with sending a SIGTSTOP signal and resume with a SIGCONT signal.
But I found that when I suspend a long time, ud will timeout
after I resume MPI program.
I want to ask,
- Dose ucx support to deal with
SIGTSTOP
andSIGCONT
for a long time with UD? - How about tls RC?
PS: I tried with a big value of UCX_UD_MLX5_TIMEOUT, it's OK.
The above behavior makes sense. Without progress in the process, UD will consider the packet lost and may eventually time out.
The above behavior makes sense. Without progress in the process, UD will consider the packet lost and may eventually time out.
Thanks for your reply. If I add a singal handler for UD in UCT to reset the timer to aviod timeout, do you think this is a reasonable design?
The above behavior makes sense. Without progress in the process, UD will consider the packet lost and may eventually time out.
Thanks for your reply. If I add a singal handler for UD in UCT to reset the timer to aviod timeout, do you think this is a reasonable design?
It would work if all processes in the job are stopped and continued simultaneously - since if a process is stopped, the timeout will happen in other processes that communicate with it.