ucx icon indicating copy to clipboard operation
ucx copied to clipboard

Can ucx support to deal with SIGSTOP and SIGCONT?

Open razor1991 opened this issue 1 year ago • 3 comments

I'm using openmpi with ucx 1.10.1, using ud tls. I want to suspend MPI with sending a SIGTSTOP signal and resume with a SIGCONT signal. But I found that when I suspend a long time, ud will timeout after I resume MPI program.

image

I want to ask,

  1. Dose ucx support to deal with SIGTSTOP and SIGCONT for a long time with UD?
  2. How about tls RC?

PS: I tried with a big value of UCX_UD_MLX5_TIMEOUT, it's OK.

razor1991 avatar Nov 02 '23 05:11 razor1991

The above behavior makes sense. Without progress in the process, UD will consider the packet lost and may eventually time out.

yosefe avatar Nov 02 '23 09:11 yosefe

The above behavior makes sense. Without progress in the process, UD will consider the packet lost and may eventually time out.

Thanks for your reply. If I add a singal handler for UD in UCT to reset the timer to aviod timeout, do you think this is a reasonable design?

razor1991 avatar Nov 07 '23 03:11 razor1991

The above behavior makes sense. Without progress in the process, UD will consider the packet lost and may eventually time out.

Thanks for your reply. If I add a singal handler for UD in UCT to reset the timer to aviod timeout, do you think this is a reasonable design?

It would work if all processes in the job are stopped and continued simultaneously - since if a process is stopped, the timeout will happen in other processes that communicate with it.

yosefe avatar Nov 07 '23 08:11 yosefe