flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

broker: subprocess.c:249: subprocess_check_completed: Assertion `p->state == FLUX_SUBPROCESS_EXITED' failed.

Open grondo opened this issue 9 months ago • 21 comments

We just hit this assertion failure on elcap:

 broker: subprocess.c:249: subprocess_check_completed: Assertion `p->state == FLUX_SUBPROCESS_EXITED' failed.

a core was captured. Here's the backtrace

#0  0x00007ffff61eeacf in raise () from /lib64/libc.so.6
#1  0x00007ffff61c1ea5 in abort () from /lib64/libc.so.6
#2  0x00007ffff61c1d79 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#3  0x00007ffff61e7426 in __assert_fail () from /lib64/libc.so.6
#4  0x00007fff70ec0fdf in subprocess_check_completed (p=<optimized out>) at subprocess.c:249
#5  subprocess_check_completed (p=<optimized out>) at subprocess.c:247
#6  0x00007fff70ec03ff in remote_completion (p=0x7ffcf8dbea80) at remote.c:605
#7  rexec_continuation (f=0x7ffcfad50030, arg=0x7ffcf8dbea80) at remote.c:618
#8  0x00007ffff7b958e3 in ev_invoke_pending (loop=0x7fff14002450) at ev.c:3770
#9  0x00007ffff7b99988 in ev_run (flags=0, loop=0x7fff14002450) at ev.c:4190
#10 ev_run (loop=0x7fff14002450, flags=0) at ev.c:4021
#11 0x00007ffff7b6b7ff in flux_reactor_run (r=0x7fff14002430, flags=flags@entry=0) at reactor.c:124
#12 0x00007fff70eb3214 in mod_main (h=0x7fff14000be0, argc=<optimized out>, argv=0x7fff14003260) at job-exec.c:1565
#13 0x0000000000411ecf in module_thread (arg=0x136dd30) at module.c:221
#14 0x00007ffff79301ca in start_thread () from /lib64/libpthread.so.0
#15 0x00007ffff61d9e73 in clone () from /lib64/libc.so.6

and the subprocess structure in question:

{h = 0x7fff14000be0, reactor = 0x7fff14002430, rank = xxxx, flags = 0, local = false, llog = 0x7ffff7b67420 <flux_llog>, llog_data = 0x7fff14000be0, refcount = 1, pid = 0, pid_set = false, ops = {on_completion = 0x7fff70eb6d60 <exec_complete_cb>, 
    on_state_change = 0x7fff70eb6c10 <exec_state_cb>, on_channel_out = 0x7fff70eb6810 <exec_output_cb>, on_stdout = 0x7fff70eb6810 <exec_output_cb>, on_stderr = 0x7fff70eb6810 <exec_output_cb>}, cmd = 0x7ffcf8dbe480, aux = 0x7ffcfad50310, channels = 0x7ffcf8dbec50, 
  channels_eof_expected = 2, channels_eof_sent = 1, status = 0, state = FLUX_SUBPROCESS_INIT, state_reported = FLUX_SUBPROCESS_INIT, state_prep_w = 0x7ffcfad4e230, state_idle_w = 0x7ffcfad4e280, state_check_w = 0x7ffcfad4e2d0, completed = false, 
  completed_prep_w = 0x7ffcfad4e320, completed_idle_w = 0x7ffcfad4e370, completed_check_w = 0x7ffcfad4e3c0, sync_fds = {12061, 12062}, in_hook = false, child_w = 0x0, hooks = {pre_exec = 0x0, pre_exec_arg = 0x0, post_fork = 0x0, post_fork_arg = 0x0}, 
  service_name = 0x7ffcfad4e210 "sdexec", f = 0x7ffcfad50030, remote_completed = true, failed_errno = 0, failed_error = {text = '\000' <repeats 159 times>}, signal_pending = 0}

grondo avatar May 10 '24 17:05 grondo

Looking at code, this assertion failure can trigger if the subprocess server terminates the streaming subprocess response before the subprocess itself has been marked as EXITED.

Since it appears from above that the PID for this remote process has not yet even been set, I wonder if there is some case where an early error can cause this condition.

We did have a couple nodes that were OOMing around the time of this failure. In that case, maybe fork(2) or posix_spawn(3) failed in a way that we haven't tested, and the subprocess server sides terminates the RPC before a PID is even available? All speculation at this point, but perhaps I can try to recreate this condition.

I did think that we tested many failure modes in the subprocess protocol, so I could be off base here.

grondo avatar May 10 '24 17:05 grondo

Unfortunately it isn't as simple as the above. If there's an error in fork/spawn on the remote side then that error is propagated back in a response to rexec.exec which properly "completes" the subprocess without calling remote_completion(). Somehow we must be getting farther along than that, then explicitly sending ENODATA.

grondo avatar May 10 '24 18:05 grondo

We did have a couple nodes that were OOMing around the time of this failure. In that case, maybe fork(2) or posix_spawn(3) failed in a way that we haven't tested, and the subprocess server sides terminates the RPC before a PID is even available? All speculation at this point, but perhaps I can try to recreate this condition.

I think it's a good speculation. My first guess was that the subprocess state was FAILED and we had an error in the recent libsubprocess performance improvement. But state = FLUX_SUBPROCESS_INIT suggests something failed quite early on.

    if (create_process (p) < 0)                                                                          
        return -1;                                                                                       
                                                                                                         
    p->state = FLUX_SUBPROCESS_RUNNING;                                                                  

we set the state to running right after we create the process. So there's not a lot of window for some other error to happen. That or some error in the normal subprocess setup (signal handling, fd setup, etc.)

I notice that remote_completed is true, so

    if (subprocess_rexec_get (f) < 0) {                                                                  
        if (errno == ENODATA) {                                                                          
            remote_completion (p);                                                                       
            return;                                                                                      
        }                                                                                                
        set_failed (p, "%s", future_strerror (f, errno));                                                
        goto error;                                                                                      
    }   

and

static void remote_completion (flux_subprocess_t *p)                                                     
{                                                                                                        
    p->remote_completed = true;                                                                          
    /* TBON inorder delivery of messages should guarantee we received                                    
     * FLUX_SUBPROCESS_EXITED before this.                                                               
     */                                                                                                  
    subprocess_check_completed (p);                                                                      
}           

is the call path.

Simply assuming ENODATA means the process exited is probably just a bad assumption. We could

chu11 avatar May 10 '24 18:05 chu11

Also, crashing here is probably not the best solution. Perhaps we could turn the assertion into a warning and do proper cleanup to avoid the crash but log an error. If we could get a temporary patch and spin a v0.62.1, that would be appreciated on the elcap side.

We can also do a quick audit of other assertions in the code and promote them to warnings if it make sense.

grondo avatar May 10 '24 18:05 grondo

@chu11, if you want to take point on the fix, would it be possible to demote this assertion to a warning error by today? If we can get that patch ready and somewhat tested, I could throw it into a flux-core-0.62.0-2 RPM so if it happens again we're ready with a fix on elcap.

Meanwhile, we can try to get the bottom of the actual issue. FYI, for testing I was using the following patch:

diff --git a/src/common/libsubprocess/local.c b/src/common/libsubprocess/local.c
index 40a484284..8ee1ea1e8 100644
--- a/src/common/libsubprocess/local.c
+++ b/src/common/libsubprocess/local.c
@@ -452,6 +452,13 @@ int subprocess_local_setup (flux_subprocess_t *p)
         return -1;
     if (local_setup_channels (p) < 0)
         return -1;
+
+    if (flux_cmd_getenv (flux_subprocess_get_cmd (p),
+                         "SUBPROCESS_LOCAL_FAIL")) {
+        errno = ENOMEM;
+        return -1;
+    }
+
     if (create_process (p) < 0)
         return -1;
 

which allows an ENOMEM to be simulated by setting an environment variable in the subprocess you're running. We could move that around and see if we could trigger the bug...

grondo avatar May 10 '24 18:05 grondo

Hmm an extra note for later

static void proc_completion_cb (flux_subprocess_t *p)                                                    
{                                                                                                        
    subprocess_server_t *s = flux_subprocess_aux_get (p, srvkey);                                        
    const flux_msg_t *request = flux_subprocess_aux_get (p, msgkey);                                     
                                                                                                         
    if (p->state != FLUX_SUBPROCESS_FAILED) {                                                            
        /* no fallback if this fails */                                                                  
        if (flux_respond_error (s->h, request, ENODATA, NULL) < 0) {                                     
            llog_error (s,                                                                               
                        "error responding to rexec.exec request: %s",                                    
                        strerror (errno));                                                               
        }                                                                                                
    }                                                                                                    
                                                                                                         
    proc_delete (s, p);                                                                                  
}           

A) I think state != FAILED is assumed state == EXITED, maybe not a good assumption

B) BUT why is the completion callback being called? I'm wondering if during the process setup, is there a tiny window where callbacks have been setup and an error occurs before we return to user?

chu11 avatar May 10 '24 18:05 chu11

Unfortunately it isn't as simple as the above. If there's an error in fork/spawn on the remote side then that error is propagated back in a response to rexec.exec which properly "completes" the subprocess without calling remote_completion(). Somehow we must be getting farther along than that, then explicitly sending ENODATA.

Good point. Perhaps we should instead say that a failure occurs really early on such that the subprocess doesn't transition to RUNNING. So it is failing pretty early on.

chu11 avatar May 10 '24 18:05 chu11

@chu11, if you want to take point on the fix, would it be possible to demote this assertion to a warning error by today? If we can get that patch ready and somewhat tested, I could throw it into a flux-core-0.62.0-2 RPM so if it happens again we're ready with a fix on elcap.

lemme work on a assertion demotion patch

chu11 avatar May 10 '24 18:05 chu11

Perhaps we should instead say that a failure occurs really early on such that the subprocess doesn't transition to RUNNING. So it is failing pretty early on.

So I've been trying to see if there is a race possibility, but unable to find it yet. Where somehow the state change callback does not send state RUNNING and EXITED to the client before ENODATA is sent. Will have to look at this further later.

chu11 avatar May 10 '24 19:05 chu11

Per comment in #5959


FWIW I contemplated adding a state check here instead, like:

static void remote_completion (flux_subprocess_t *p)
{
    p->remote_completed = true;
    /* TBON inorder delivery of messages should guarantee we received
     * FLUX_SUBPROCESS_EXITED before this.
     */
    if (p->state == FLUX_SUBPROCESS_EXITED)
        subprocess_check_completed (p);
}

I think this may be the more "correct" fix, but it assumes that code is racy and subprocess_check_completed() would be called elsewhere at appropriate points in time. Since I didn't know for sure what was happening, the proposed fix is the current "safer" one.


So we may want to revert #5959 (or adjust it in whatever way we want) and propose the above when we confidence what caused this crash.

chu11 avatar May 10 '24 23:05 chu11

After a lot of code review, I just cannot see a path in which the on_completion callback can be called before the on_state_change callback. And this check for state EXITED being reported in completed_check_cb() should ensure that

static void completed_check_cb (flux_reactor_t *r,                                                                                          
                                flux_watcher_t *w,                                                                                          
                                int revents,                                                                                                
                                void *arg)                                                                                                  
{                                                                                                                                           
    flux_subprocess_t *p = arg;                                                                                                             
                                                                                                                                            
    assert (p->completed);                                                                                                                  
                                                                                                                                            
    flux_watcher_stop (p->completed_idle_w);                                                                                                
                                                                                                                                            
    /* always a chance caller may destroy subprocess in callback */                                                                         
    subprocess_incref (p);                                                                                                                  
                                                                                                                                            
    /* There is a small "racy" component, where the state we're at may                                                                      
     * not yet align with the state that has been reported to the                                                                           
     * user.  We would like to report state EXITED to the user before                                                                       
     * calling the completion callback.                                                                                                     
     *                                                                                                                                      
     * If no state change callback was specified, we must have reached                                                                      
     * state FLUX_SUBPROCESS_EXITED to have reached this point.                                                                             
     */                                                                                                                                     
    if (!p->ops.on_state_change                                                                                                             
        || p->state_reported == FLUX_SUBPROCESS_EXITED) {                                                                                   
        if (p->ops.on_completion)                                                                                                           
            (*p->ops.on_completion) (p);                                                                                                    
                                                                                                                                            
        flux_watcher_stop (p->completed_prep_w);                                                                                            
        flux_watcher_stop (p->completed_check_w);                                                                                           
    }                                                                                                                                       
                                                                                                                                            
    subprocess_decref (p);                                                                                                                  
}                         

alternate theory now is some error occurs, code does not handle it correctly and set state FAILED, so state stays at INIT, leading to the error path we've seen here (proc_complete_cb() sends ENODATA, etc.).

chu11 avatar May 10 '24 23:05 chu11

Further code review cannot seem to uncover a path that an unexpected error could cause.

A few notes/thoughts

A) it appears the remote broker restarted around the time of the OOMs as well, that could be related

B) I cannot see it, but is it possible the ENODATA comes through some other means than the subprocess server? For example, if the broker were to die/exit/oom suddenly, could streamed RPCs be sent ENODATA?

I began looking for some clue that this could be true. Because at some point, it seems amazingly unlikely / improbable that the stream from the subprocess server could send ENODATA before sending the pid / state change to RUNNING.

C) if this occurs more often I think some type of "catastrophic" handling on the case might be best in libsubprocess. Basically just clean itself up and make it know that the remote process is toast. More thought into how to do this.

D) If there were errors, I suspect they'd be collected by the failed broker and the imp in some fashion (some stuff possibly logged to stderr) and possibly stored in systemd since it was a failed unit?

I'm not sure how we can get the logs from the previously departed flux broker?

I was not able to find any evidence of the failed imp units, but 1) I'm not sure if we save failed units (need to double check) and 2) it's possible I am not using the right systemctl runes, I think systemctl --user --failed should work to list all failed user units (after setting DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/$(id -u flux)/bus)

chu11 avatar May 11 '24 05:05 chu11

Just wanted to be sure we're all on the same page since some comments indicate otherwise: the subprocess server in play here is the one in the sdexec module (indicated as the service in the struct dump), not the one in libsubprocess. I think we've only recently switched over to that one on the big system, so it seems plausible that it's not behaving correctly under oom conditions or whatever and we're only just now encountering it.

if the broker were to die/exit/oom suddenly, could streamed RPCs be sent ENODATA?

It shouldn't. If the connection to the broker running the subprocess server is lost, code in the broker on the client side generates EHOSTUNREACH:

https://github.com/flux-framework/flux-core/blob/master/src/broker/overlay.c#L720

If the sdexec module gets unloaded with clients, e.g. the broker were to try to shut down for some reason, any pending requests should get ENOSYS

https://github.com/flux-framework/flux-core/blob/master/src/modules/sdexec/sdexec.c#L935

I think systemctl --user --failed should work to list all failed user units (after setting DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/$(id -u flux)/bus)

With 0.62.0 you can just do sudo flux exec -r RANK systemctl --user --failed. The env is set up as of #5903.

FYI sdexec tries to clean up failed units

garlick avatar May 11 '24 13:05 garlick

the subprocess server in play here is the one in the sdexec module (indicated as the service in the struct dump), not the one in libsubprocess

Ah, yes. Good catch. (At least I had lost sight of that when I was quickly trying to reproduce, and probably lead @chu11 in the wrong direction, sorry!)

grondo avatar May 11 '24 14:05 grondo

the subprocess server in play here is the one in the sdexec module (indicated as the service in the struct dump), not the one in libsubprocess.

Ohh crud, I completely forgot!

FYI sdexec tries to clean up failed units

Ahhh ok. I remember with my prototype I left failed units running for later analysis. Perhaps we should have an option for that in the future for situations such as this? I suppose it could be system wide or a per job option?

chu11 avatar May 11 '24 14:05 chu11

Ahhh ok. I remember with my prototype I left failed units running for later analysis. Perhaps we should have an option for that in the future for situations such as this? I suppose it could be system wide or a per job option?

I think the problem is that any non "exit 0" behavior on the part of the job results in a nonzero shell+imp exit and a failed unit, so they could potentially pile up quickly. But maybe it's still something we should do or could do with cleanup of old jobs after some high water mark. If you think that's a good idea let's get an issue open.

garlick avatar May 11 '24 14:05 garlick

looking through code in sdexec I'm wondering if going through systemd is inherently racy. So much of the systemd exec code goes under the assumption that certain status/states/etc. will be set and recorded by systemd and reported to sdexec and in a certain order, that things can fall apart if that doesn't happen. (Edit: or a slightly alternate way to think about it, we expect things to behave a little bit like fork() + exec(), but that probably isn't the perfect assumption).

For example, if the job is OOMed before systemd has internally decided to report the PID to "liseners", perhaps it'll just not report the PID. And now the sdexec code goes south as a result.

Example

static void finalize_exec_request_if_done (struct sdproc *proc)                                                                          
{                                                                                                                                        
    if (sdexec_unit_state (proc->unit) == STATE_INACTIVE                                                                                 
        && sdexec_unit_substate (proc->unit) == SUBSTATE_DEAD                                                                            
        && (!proc->out || proc->out_eof_sent)                                                                                            
        && (!proc->err || proc->err_eof_sent)) {                                                                                         
                                                                                                                                         
        /* If there was an exec error, fail with ENOENT.                                                                                 
         * N.B. we have no way of discerning which exec(2) error occurred,                                                               
         * so guess ENOENT.  It could actually be EPERM, for example.                                                                    
         */                                                                                                                              
        if (sdexec_unit_has_failed (proc->unit)) {                                                                                       
            flux_error_t error;                                                                                                          
            errprintf (&error,                                                                                                           
                       "unit process could not be started (systemd error %d)",                                                           
                       sdexec_unit_systemd_error (proc->unit));                                                                          
            exec_respond_error (proc, ENOENT, error.text);                                                                               
        }                                                                                                                                
        else                                                                                                                             
            exec_respond_error (proc, ENODATA, NULL);                                                                                    
    }                                                                                                                                    
}  

Some unexpected special error code / status / something that we don't yet understand would lead to sdexec_unit_has_failed() returning false and would explain the unexpected ENODATA.

I can give it the good ole college try to see if I can oom something and see if we can get an idea of what behavior may result. However, I suspect this is insanely racy. Perhaps we just have to modify the libsubprocess client side (i.e. remote.c code) to just handle these exceptional cases in a special way.

chu11 avatar May 12 '24 17:05 chu11

Ah yeah, it does seem like there should be more criteria checked before ENODATA is returned in that function.

Perhaps we just have to modify the libsubprocess client side (i.e. remote.c code) to just handle these exceptional cases in a special way.

I'll just note that things changed a bit when a new server implementation was added, so assertions that assured self-consistency before, when the client and server were one code base, are no longer appropriate. My apologies for perturbing the landscape without auditing the code for this sort of situation. (Well I did try but didn't quite get there obviously)

IMHO this sitiuation should now be handled by the client as an EPROTO error rather than an assertion.

Also it may be wise to add protocol invariants like the one violated here to RFC 42, which is currently just an outline that defines payloads. For this one, something could be added to the the exec error response section.

garlick avatar May 12 '24 17:05 garlick

IMHO this sitiuation should now be handled by the client as an EPROTO error rather than an assertion.

Agreed. My gut feeling is that because we're launching processes with a third party service, a "fork() + exec()" like behavior just can't be assumed. Such as my example above, can we be guaranteed to get a PID for the launched process as long as the process is forked? Under normal fork(), yes, but with systemd maybe not. It may simply choose not to send it under some catastrophic error scenarios.

I'll review the sdexec() code some more, but I'm thinking we add some comments indicating some assumptions and potential pitfalls in the assumptions.

On the client side, unexpected ENODATA and possibly some other scenarios shall be assumed to be "internal fatal error" and handled accordingly.

chu11 avatar May 12 '24 20:05 chu11

On the client side, unexpected ENODATA and possibly some other scenarios shall be assumed to be "internal fatal error" and handled accordingly.

I assume that would mean on_state_change() is called with a state of FLUX_SUBPROCESS_FAILED and that we could provide some error detail with flux_subprocess_fail_error(), like

protocol error: server returned success without returning 'started' response

garlick avatar May 13 '24 14:05 garlick

I assume that would mean on_state_change() is called with a state of FLUX_SUBPROCESS_FAILED and that we could provide some error detail with flux_subprocess_fail_error(), like

Skimming code, yes on_state_change() will be called with FLUX_SUBPROCESS_FAILED (I'm thinking going through the "goto error" path in rexec_continuation()) and we should be able to provide some user friendly error.

chu11 avatar May 13 '24 15:05 chu11