flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

sdexec: do not clean up failed units

Open chu11 opened this issue 1 year ago • 2 comments

Per discussion in #5956, it may be useful at times to not cleanup failed systemd units for post analysis later.

The way this was handled in the old libsdprocess was through:

    /* In sdprocess, we require the systemd unit to exist until the                                                                            
     * user cleans it up with sdprocess_systemd_cleanup().  This                                                                               
     * ensures consistent behavior in a number of functions.  For                                                                              
     * example, a function like sdprocess_wait() can be called                                                                                 
     * multiple times.  Therefore, we set RemainAfterExit to true                                                                              
     * for every process we start.                                                                                                             
     */                                                                                                                                        
    if ((ret = sd_bus_message_append (m,                                                                                                       
                                      "(sv)",                                                                                                  
                                      "RemainAfterExit",                                                                                       
                                      "b",                                                                                                     
                                      true)) < 0) {                                                                                            
        set_errno_log (sdp->h, ret, "sd_bus_message_append");                                                                                  
        return -1;                                                                                                                             
    }      

This apparently keeps all units around. This would pile up old units quickly and not be good, so perhaps we could make this a per-job configuration, in case we need to debug a very specific job that is failing in a unique way. Then after that job was done we'd clean up the successful units. Or somewhere in the sdexec module it would do the cleanup of successful units.

chu11 avatar May 12 '24 16:05 chu11

We do set RemainAfterExit on all units now:

https://github.com/flux-framework/flux-core/blob/master/src/common/libsdexec/start.c#L302

As I recall, successful units do not remain but failed ones do, until ResetFailedUnit is called:

https://github.com/flux-framework/flux-core/blob/master/src/modules/sdexec/sdexec.c#L283

garlick avatar May 12 '24 18:05 garlick

~~Ahh, hmmm then why the failed one from #5956 didn't stick around is a mystery (although it's possible I did not use quite the right systemctl runes ...)~~ neverind, I misread your prior comment.

So perhaps we shouldn't called ResetFailedUnit on some special config.

chu11 avatar May 12 '24 23:05 chu11