flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

jobs may have an R in KVS with no `alloc` event and therefore no `t_run` timestamp

Open grondo opened this issue 2 months ago • 0 comments

Working with @cmoussa1 on flux-framework/flux-accounting#774 we were trying to understand how the issue was even reproducible. To summarize, the accounting scripts were hitting an unexpected case where a job has an R(as returned from the job-info service), but no t_run timstamp (as returned from the job-list service). I'm not even sure this is a bug, but since it is somewhat unexpected behavior it seemed like a good idea to document in an issue.

I wrote a quick script to find specific cases of jobs that meet this critera:

import flux
import flux.job

h = flux.Flux()

attrs = ["userid", "t_submit", "t_run", "t_inactive", "ranks"]
jobs = flux.job.job_list_inactive(h, attrs=attrs, max_entries=0).get_jobs()
print(f"checking {len(jobs)} inactive jobs")
for job in jobs:
    jobid = flux.job.JobID(job["id"])
    data = flux.job.job_kvs_lookup(h, jobid, keys=["R", "jobspec"], decode=False)
    if data is not None and "t_run" not in job:
        print(f"{jobid} has R, jobspec but no t_run")

and got quite a few hits.

Spot checking the results, these jobs were canceled before the alloc event, e.g.:

# flux job eventlog -H f2RWxG9jtekw
[Oct15 19:18] submit userid=65930 urgency=16 flags=0 version=1
[  +0.026443] jobspec-update attributes.system.project="*"
[  +0.026503] validate
[  +0.046737] depend
[  +0.046781] memo fairshare=0.80496699999999999
[  +0.046789] priority priority=80497
[Oct15 19:20] exception type="cancel" severity=0 note="" userid=65930
[  +0.000078] clean

but an R is in the KVS:

# flux job info f2RWxG9jtekw R
{"version": 1, "execution": {"R_lite": [{"rank": "242", "children": {"core": "0-95", "gpu": "0-3"}}], "nodelist": ["tuolumne1242"], "properties": {"pall": "242", "pbatch": "242"}, "starttime": 1760581205, "expiration": 1760667605}}

My guess is that this situation occurs when a job is canceled after the KVS commit request is made to place R in the KVS, but before it is fulfilled, which is likely when the alloc event is posted to the eventlog. This could perhaps be more likely on a very busy system.

I'm not sure if there's anything to do here. The job-manager could delete any R in the KVS when this situation occurs, but I'm not sure it is worth the effort, and may just be something to be aware of...

grondo avatar Oct 27 '25 17:10 grondo