flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

flux jobs does not display CLEANUP state by default

Open garlick opened this issue 3 years ago • 10 comments

Problem: if a job is stuck completing for whatever reason (such as the painfully slow epilog on LC systems, but another one would be a hung node), flux jobs shows the job as R (running). This abbreviated state is obscuring what is really going on.

Just now I tried to run a program with the wrong path on corona and got the appropriate error from flux mini run. It's now about 10 minutes later and it's still showing as R. If it showed as completing, it would be more obvious what is happening.

Most of the job states exist because they convey information about the job that could be useful to users. We may want to re-evaluate the decision to only show the "high level" states by default.

garlick avatar Aug 16 '22 16:08 garlick

I assume you mean "COMPLETED" or are you referring to adding a new state called "COMPLETING"?

I'm wondering if the "virtual states" we have listed in https://github.com/flux-framework/rfc/blob/master/spec_21.rst should perhaps split out RUNNING state into a RUNNING and CLEANUP. That would make it more clear that the job is almost done and we're just awaiting cleanup.

chu11 avatar Aug 16 '22 17:08 chu11

In addition, @garlick made a case that displaying the virtual state in flux jobs is also obscuring information about the various "PENDING" states. I.e. it would be nice to know if the job is stuck in PRIORITY or DEPEND by default. The virtual states are most useful for querying jobs, perhaps not as useful as a default in flux-jobs? So, maybe we should just display the actual job state by default.

grondo avatar Aug 16 '22 17:08 grondo

I guess I'm arguing that we just display the actual states not the "virtual" states. The actual states are

NEW, DEPEND, PRIORITY, SCHED, RUN, CLEANUP, INACTIVE.

https://flux-framework.readthedocs.io/projects/flux-rfc/en/latest/spec_21.html

(Edit: sorry I was sloppy in my comment above - I meant "CLEANUP" not "completing")

garlick avatar Aug 16 '22 17:08 garlick

So the initial reason for the status output is we felt that PENDING, RUNNING, COMPLETED, CANCELLED, FAILED was the subset of information users really wanted. That "depend", "priority", "sched" have little meaning to them, not mention just listing the job state wouldn't show if a job was successful or failed or cancelled. EDIT: I guess we could combine job state with the other information in some way??

Here's an random idea, I opened this eons ago:

https://github.com/flux-framework/flux-core/issues/2627

perhaps would a verbose option with extra info that developers/admins be interested in be a good idea?

chu11 avatar Aug 16 '22 17:08 chu11

My job that failed to start was listed as R in the ST column. That's just weird.

I vote we expose all of the primary states in the default output and skip the virtual states. Since inactive jobs are not shown by default, whether we show a virtual state for them or not is maybe a different question (F for failed is genuinely useful there instead of INACTIVE IMHO).

Seeing the current view in action and expecting to have to explain it to people (although so far everyone's been far too nice) makes me think our original assessment about "what users want" was likely incorrect.

garlick avatar Aug 16 '22 18:08 garlick

well, job status is:

PENDING, RUNNING, COMPLETED, FAILED, or CANCELED

COMPLETED, FAILED, or CANCELED are just replacements for INACTIVE given what happened with a job. Perhaps we can just axe PENDING/RUNNING and replace them with DEPEND, PRIORITY, SCHED, or RUN. So status is really a combination of job_state and result.

chu11 avatar Aug 16 '22 18:08 chu11

Perhaps we can just axe PENDING/RUNNING and replace them with DEPEND, PRIORITY, SCHED, or RUN. So status is really a combination of job_state and result.

(and don't forget CLEANUP)

I think this is what we're arguing for. For INACTIVE jobs we can split out into COMPLETED, FAILED, CANCELED and TIMEOUT since that is useful. It just turned out that combining all the PENDING/RUNNING states was probably not the right approach (However, those are still useful for filtering jobs, so I say we keep the virtual PENDING and RUNNING for that purpose)

grondo avatar Aug 16 '22 18:08 grondo

I think this is what we're arguing for. For INACTIVE jobs we can split out into COMPLETED, FAILED, CANCELED and TIMEOUT since that is useful. It just turned out that combining all the PENDING/RUNNING states was probably not the right approach (However, those are still useful for filtering jobs, so I say we keep the virtual PENDING and RUNNING for that purpose)

Yup, that's what I'm thinking too now. I'll create a new issue for TIMEOUT though, so that is a separate issue.

chu11 avatar Aug 16 '22 18:08 chu11

TIMEOUT is already displayed. No need for another issue.

grondo avatar Aug 16 '22 18:08 grondo

Seeing the current view in action and expecting to have to explain it to people (although so far everyone's been far too nice) makes me think our original assessment about "what users want" was likely incorrect.

I'm not sure we ever said that what users want is just PENDING vs RUNNING. I think we were more focused on splitting inactive into its virtual status. However, what they do want is not have to list all the "pending" job states to get a list of jobs that are truly pending, so we should definitely keep that functionality.

grondo avatar Aug 16 '22 19:08 grondo