flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

job-info: return job state as number encourages hardcoding values in shell scripts

Open garlick opened this issue 5 years ago • 11 comments

Problem: job-info.list returns job states as numerical values. This encourages people parsing the output in a bash script with jq to hard code the enum values.

It would be better IMHO to return the state names (strings) from RFC 21.

That might work out better in the python bindings too...

garlick avatar Jan 15 '20 15:01 garlick

Seems like a good idea, instead of the user converting numeral to string.

chu11 avatar Jan 15 '20 15:01 chu11

Another approach would be to add flux job statestr N utility that wraps flux_job_statetostr() and allow scripts to use that. I'm ambivalent either way. (though if states ever start getting other flags masked in, sending an integer would start to be much more compact, instead of just a little more compact)

grondo avatar Jan 15 '20 17:01 grondo

That's maybe a better idea, thanks!

garlick avatar Jan 15 '20 19:01 garlick

Granted flux job is a plumbing tool, but I just have trouble seeing the use case where someone needs the number of a job state returned from job-info.list, so I like the idea of returning the string.

If we really need a numeral someday, we could just add a new attribute in the job-info.list service, like state-number or something.

chu11 avatar Jan 15 '20 19:01 chu11

I wasn't considering a requirement for a state number instead of string, but instead was considering the size of underlying json strings. Probably not an issue at this point since it would just add a few characters per job.

grondo avatar Jan 15 '20 21:01 grondo

It occurred to me this morning that switching to a state string may be a little more inconvenient in C/C++ code. E.g. @dongahn's MPIR PR we have:

    if (state > FLUX_JOB_RUN)

We'd likely at least want to add flux_job_stringtostate (s) to the API if state was returned as a string, though the state as integer seems much more convenient here.

grondo avatar Jan 18 '20 15:01 grondo

OK, I'm thinking now it may be less bad to keep the numbers. It will be irritating to have to convert in order to use it it in masks and switch statements in C.

if (state > FLUX_JOB_RUN)

This is probably a bit unsafe though, in case we ever add a state and tack it on the end numerically to avoid breaking ABI (and it doesn't "come after" RUN).

Would it be appropriate to formalize the numbers and maybe the single character abbreviations in RFC 21?

garlick avatar Jan 18 '20 17:01 garlick

This is probably a bit unsafe though, in case we ever add a state and tack it on the end numerically to avoid breaking ABI (and it doesn't "come after" RUN).

Good point!

Would it be appropriate to formalize the numbers and maybe the single character abbreviations in RFC 21?

Good idea. I wonder if we'll also want to explicitly state that the order of the state numbers is meaningless, and provide some convenience functions to satisfy use cases like this. Actually maybe in the case pasted above (where flux job attach wants to detect if a job has not proceeded past the RUN state), the code should use the virtual states (FLUX_JOB_PENDING | FLUX_JOB_RUNNING).

I'll comment in #2654 to that effect.

Note that I'm still open to switching job states away from state numbers. As long as we've got the use cases easily covered. (e.g. even a python script may want to test for a virtual state like "pending" . This could be done with state strings just as easily as state numbers, just a lot more string comparisons)

grondo avatar Jan 18 '20 18:01 grondo

Actually maybe in the case pasted above (where flux job attach wants to detect if a job has not proceeded past the RUN state), the code should use the virtual states (FLUX_JOB_PENDING | FLUX_JOB_RUNNING).

Uh I guess that should have been FLUX_JOB_PENDING | FLUX_JOB_RUN, because the CLEANUP state would also be invalid for the debugger attach case.

grondo avatar Jan 18 '20 18:01 grondo

This issue has been automatically marked as stale because it has not had activity for 365 days. It will be closed if no further activity occurs within 14 days. Thank you for your contributions.

stale[bot] avatar Mar 05 '21 07:03 stale[bot]

i'm wondering if this issue should be resurrected given discussion in #4273, #4313, and #4234

  • we may want to dump large amounts of job data into the archive in 1 big json blob, so that's a lot of hard coded values going into the archive
  • i have a WIP branch that adds a new "states_mask" (bitmask of states that have been seen by a job) that would probably be needed for archiving, so thats more hardcoded stuff going into the archive.
  • additional job states may be coming: https://github.com/flux-framework/rfc/pull/306

chu11 avatar May 05 '22 14:05 chu11