flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

libsubprocess: stopped state change not handled correctly

Open chu11 opened this issue 1 year ago • 5 comments

I noticed in the following test in libsubprocess:

ok 16 - subprocess state == RUNNING after flux_local_exec
ok 17 - flux_subprocess_kill SIGSTOP
# state_change_stopped: state = Stopped
ok 18 - subprocess state == STOPPED in state change handler
# state_change_stopped: state = Running
# state_change_stopped: state = Exited
ok 19 - flux_reactor_run returned zero status

Which obviously looks weird, that the state change callback got the "Stopped" state before "Running".

I think it's because in the child watcher we have:

    if (WIFSTOPPED (p->status)) {                                                                                                           
        if (p->ops.on_state_change)                                                                                                         
            (*p->ops.on_state_change) (p, FLUX_SUBPROCESS_STOPPED);                                                                         
    }    

on_state_change() is called immediately rather than going through the state change prep/check functions, e.g. right below this:

    if (WIFEXITED (p->status) || WIFSIGNALED (p->status)) {                                                                                 
                                                                                                                                            
        /* remote/server code may have set FAILED on fatal errors.                                                                          
         */                                                                                                                                 
        if (p->state == FLUX_SUBPROCESS_RUNNING) {                                                                                          
            p->state = FLUX_SUBPROCESS_EXITED;                                                                                              
            state_change_start (p);                                                                                                         
        }  

So we need to redo things to go through the state change prep/check functions. I am not quite sure how at the moment and it may take some thought. See the state_change_next() function:

static flux_subprocess_state_t state_change_next (flux_subprocess_t *p)                                                                     
{                                                                                                                                           
    assert (p->state != FLUX_SUBPROCESS_FAILED);                                                                                            
                                                                                                                                            
    switch (p->state_reported) {                                                                                                            
    case FLUX_SUBPROCESS_INIT:                                                                                                              
        /* next state must be RUNNING */                                                                                                    
        return FLUX_SUBPROCESS_RUNNING;                                                                                                     
    case FLUX_SUBPROCESS_RUNNING:                                                                                                           
        /* next state is EXITED */                                                                                                          
        return FLUX_SUBPROCESS_EXITED;                                                                                                      
    case FLUX_SUBPROCESS_EXITED:                                                                                                            
    case FLUX_SUBPROCESS_FAILED:                                                                                                            
    case FLUX_SUBPROCESS_STOPPED:                                                                                                           
        break;                                                                                                                              
    }                                                                                                                                       
                                                                                                                                            
    /* shouldn't be possible to reach here */                                                                                               
    assert (0);                                                                                                                             
}

simply assumes the next state after RUNNING is EXITED, but it could be STOPPED instead. I suspect some type of "state change queue" of sorts needs to be implemented to get all the states transmitted in the correct order.

Note that I created issue #5082, not so much because #5082 is that important, but if this issue is corrected, it might be wise to just handle that at the same time as STOPPED can transition to RUNNING (or a new as of yet named state).

Note to Self: We should also document all of the legal state transitions that can occur too.

chu11 avatar Apr 10 '23 18:04 chu11