flux-core
flux-core copied to clipboard
libsubprocess: stopped state change not handled correctly
I noticed in the following test in libsubprocess
:
ok 16 - subprocess state == RUNNING after flux_local_exec
ok 17 - flux_subprocess_kill SIGSTOP
# state_change_stopped: state = Stopped
ok 18 - subprocess state == STOPPED in state change handler
# state_change_stopped: state = Running
# state_change_stopped: state = Exited
ok 19 - flux_reactor_run returned zero status
Which obviously looks weird, that the state change callback got the "Stopped" state before "Running".
I think it's because in the child watcher we have:
if (WIFSTOPPED (p->status)) {
if (p->ops.on_state_change)
(*p->ops.on_state_change) (p, FLUX_SUBPROCESS_STOPPED);
}
on_state_change()
is called immediately rather than going through the state change prep/check functions, e.g. right below this:
if (WIFEXITED (p->status) || WIFSIGNALED (p->status)) {
/* remote/server code may have set FAILED on fatal errors.
*/
if (p->state == FLUX_SUBPROCESS_RUNNING) {
p->state = FLUX_SUBPROCESS_EXITED;
state_change_start (p);
}
So we need to redo things to go through the state change prep/check functions. I am not quite sure how at the moment and it may take some thought. See the state_change_next()
function:
static flux_subprocess_state_t state_change_next (flux_subprocess_t *p)
{
assert (p->state != FLUX_SUBPROCESS_FAILED);
switch (p->state_reported) {
case FLUX_SUBPROCESS_INIT:
/* next state must be RUNNING */
return FLUX_SUBPROCESS_RUNNING;
case FLUX_SUBPROCESS_RUNNING:
/* next state is EXITED */
return FLUX_SUBPROCESS_EXITED;
case FLUX_SUBPROCESS_EXITED:
case FLUX_SUBPROCESS_FAILED:
case FLUX_SUBPROCESS_STOPPED:
break;
}
/* shouldn't be possible to reach here */
assert (0);
}
simply assumes the next state after RUNNING
is EXITED
, but it could be STOPPED
instead. I suspect some type of "state change queue" of sorts needs to be implemented to get all the states transmitted in the correct order.
Note that I created issue #5082, not so much because #5082 is that important, but if this issue is corrected, it might be wise to just handle that at the same time as STOPPED can transition to RUNNING (or a new as of yet named state).
Note to Self: We should also document all of the legal state transitions that can occur too.