jobs may have an R in KVS with no `alloc` event and therefore no `t_run` timestamp
Working with @cmoussa1 on flux-framework/flux-accounting#774 we were trying to understand how the issue was even reproducible.
To summarize, the accounting scripts were hitting an unexpected case where a job has an R(as returned from the job-info service), but no t_run timstamp (as returned from the job-list service). I'm not even sure this is a bug, but since it is somewhat unexpected behavior it seemed like a good idea to document in an issue.
I wrote a quick script to find specific cases of jobs that meet this critera:
import flux
import flux.job
h = flux.Flux()
attrs = ["userid", "t_submit", "t_run", "t_inactive", "ranks"]
jobs = flux.job.job_list_inactive(h, attrs=attrs, max_entries=0).get_jobs()
print(f"checking {len(jobs)} inactive jobs")
for job in jobs:
jobid = flux.job.JobID(job["id"])
data = flux.job.job_kvs_lookup(h, jobid, keys=["R", "jobspec"], decode=False)
if data is not None and "t_run" not in job:
print(f"{jobid} has R, jobspec but no t_run")
and got quite a few hits.
Spot checking the results, these jobs were canceled before the alloc event, e.g.:
# flux job eventlog -H f2RWxG9jtekw
[Oct15 19:18] submit userid=65930 urgency=16 flags=0 version=1
[ +0.026443] jobspec-update attributes.system.project="*"
[ +0.026503] validate
[ +0.046737] depend
[ +0.046781] memo fairshare=0.80496699999999999
[ +0.046789] priority priority=80497
[Oct15 19:20] exception type="cancel" severity=0 note="" userid=65930
[ +0.000078] clean
but an R is in the KVS:
# flux job info f2RWxG9jtekw R
{"version": 1, "execution": {"R_lite": [{"rank": "242", "children": {"core": "0-95", "gpu": "0-3"}}], "nodelist": ["tuolumne1242"], "properties": {"pall": "242", "pbatch": "242"}, "starttime": 1760581205, "expiration": 1760667605}}
My guess is that this situation occurs when a job is canceled after the KVS commit request is made to place R in the KVS, but before it is fulfilled, which is likely when the alloc event is posted to the eventlog. This could perhaps be more likely on a very busy system.
I'm not sure if there's anything to do here. The job-manager could delete any R in the KVS when this situation occurs, but I'm not sure it is worth the effort, and may just be something to be aware of...