flux-core
                                
                                 flux-core copied to clipboard
                                
                                    flux-core copied to clipboard
                            
                            
                            
                        sdexec: do not clean up failed units
Per discussion in #5956, it may be useful at times to not cleanup failed systemd units for post analysis later.
The way this was handled in the old libsdprocess was through:
    /* In sdprocess, we require the systemd unit to exist until the                                                                            
     * user cleans it up with sdprocess_systemd_cleanup().  This                                                                               
     * ensures consistent behavior in a number of functions.  For                                                                              
     * example, a function like sdprocess_wait() can be called                                                                                 
     * multiple times.  Therefore, we set RemainAfterExit to true                                                                              
     * for every process we start.                                                                                                             
     */                                                                                                                                        
    if ((ret = sd_bus_message_append (m,                                                                                                       
                                      "(sv)",                                                                                                  
                                      "RemainAfterExit",                                                                                       
                                      "b",                                                                                                     
                                      true)) < 0) {                                                                                            
        set_errno_log (sdp->h, ret, "sd_bus_message_append");                                                                                  
        return -1;                                                                                                                             
    }      
This apparently keeps all units around.  This would pile up old units quickly and not be good, so perhaps we could make this a per-job configuration, in case we need to debug a very specific job that is failing in a unique way.  Then after that job was done we'd clean up the successful units.  Or somewhere in the sdexec module it would do the cleanup of successful units.
We do set RemainAfterExit on all units now:
https://github.com/flux-framework/flux-core/blob/master/src/common/libsdexec/start.c#L302
As I recall, successful units do not remain but failed ones do, until ResetFailedUnit is called:
https://github.com/flux-framework/flux-core/blob/master/src/modules/sdexec/sdexec.c#L283
~~Ahh, hmmm then why the failed one from #5956 didn't stick around is a mystery (although it's possible I did not use quite the right systemctl runes ...)~~ neverind, I misread your prior comment.
So perhaps we shouldn't called ResetFailedUnit on some special config.