flux-core
                                
                                 flux-core copied to clipboard
                                
                                    flux-core copied to clipboard
                            
                            
                            
                        flux jobs does not display CLEANUP state by default
Problem: if a job is stuck completing for whatever reason (such as the painfully slow epilog on LC systems, but another one would be a hung node), flux jobs shows the job as R (running).  This abbreviated state is obscuring what is really going on.
Just now I tried to run a program with the wrong path on corona and got the appropriate error from flux mini run.   It's now about 10 minutes later and it's still showing as R.  If it showed as completing, it would be more obvious what is happening.
Most of the job states exist because they convey information about the job that could be useful to users. We may want to re-evaluate the decision to only show the "high level" states by default.
I assume you mean "COMPLETED" or are you referring to adding a new state called "COMPLETING"?
I'm wondering if the "virtual states" we have listed in https://github.com/flux-framework/rfc/blob/master/spec_21.rst should perhaps split out RUNNING state into a RUNNING and CLEANUP.  That would make it more clear that the job is almost done and we're just awaiting cleanup.
In addition, @garlick made a case that displaying the virtual state in flux jobs is also obscuring information about the various "PENDING" states. I.e. it would be nice to know if the job is stuck in PRIORITY or DEPEND by default. The virtual states are most useful for querying jobs, perhaps not as useful as a default in flux-jobs? So, maybe we should just display the actual job state by default.
I guess I'm arguing that we just display the actual states not the "virtual" states. The actual states are
NEW, DEPEND, PRIORITY, SCHED, RUN, CLEANUP, INACTIVE.
https://flux-framework.readthedocs.io/projects/flux-rfc/en/latest/spec_21.html
(Edit: sorry I was sloppy in my comment above - I meant "CLEANUP" not "completing")
So the initial reason for the status output is we felt that PENDING, RUNNING, COMPLETED, CANCELLED, FAILED was the subset of information users really wanted.  That "depend", "priority", "sched" have little meaning to them, not mention just listing the job state wouldn't show if a job was successful or failed or cancelled. EDIT: I guess we could combine job state with the other information in some way??
Here's an random idea, I opened this eons ago:
https://github.com/flux-framework/flux-core/issues/2627
perhaps would a verbose option with extra info that developers/admins be interested in be a good idea?
My job that failed to start was listed as R in the ST column. That's just weird.
I vote we expose all of the primary states in the default output and skip the virtual states. Since inactive jobs are not shown by default, whether we show a virtual state for them or not is maybe a different question (F for failed is genuinely useful there instead of INACTIVE IMHO).
Seeing the current view in action and expecting to have to explain it to people (although so far everyone's been far too nice) makes me think our original assessment about "what users want" was likely incorrect.
well, job status is:
PENDING, RUNNING, COMPLETED, FAILED, or CANCELED
COMPLETED, FAILED, or CANCELED are just replacements for INACTIVE given what happened with a job.  Perhaps we can just axe PENDING/RUNNING and replace them with DEPEND, PRIORITY, SCHED, or RUN.  So status is really a combination of job_state and result.
Perhaps we can just axe PENDING/RUNNING and replace them with DEPEND, PRIORITY, SCHED, or RUN. So status is really a combination of job_state and result.
(and don't forget CLEANUP)
I think this is what we're arguing for. For INACTIVE jobs we can split out into COMPLETED, FAILED, CANCELED and TIMEOUT since that is useful. It just turned out that combining all the PENDING/RUNNING states was probably not the right approach (However, those are still useful for filtering jobs, so I say we keep the virtual PENDING and RUNNING for that purpose)
I think this is what we're arguing for. For INACTIVE jobs we can split out into COMPLETED, FAILED, CANCELED and TIMEOUT since that is useful. It just turned out that combining all the PENDING/RUNNING states was probably not the right approach (However, those are still useful for filtering jobs, so I say we keep the virtual PENDING and RUNNING for that purpose)
Yup, that's what I'm thinking too now. I'll create a new issue for TIMEOUT though, so that is a separate issue.
TIMEOUT is already displayed. No need for another issue.
Seeing the current view in action and expecting to have to explain it to people (although so far everyone's been far too nice) makes me think our original assessment about "what users want" was likely incorrect.
I'm not sure we ever said that what users want is just PENDING vs RUNNING. I think we were more focused on splitting inactive into its virtual status. However, what they do want is not have to list all the "pending" job states to get a list of jobs that are truly pending, so we should definitely keep that functionality.