submitit
submitit copied to clipboard
Access some information about the job when reloading it
Hi!
Would it be possible to have access to some information about a job when reloading a Job
with its job_id
?
My use case is the following: I launched a lot of jobs, and I want to plot some metrics I logged. Most of the time, I only care about the jobs I just launched, or the jobs I launched the day before. Therefore, I would need to filter my jobs according to their launching time. If I'm correct, this is not currently possible.
Other information might be interesting, for instance knowing whether a job has been preempted, since this is a common bug source.
I tag @jrapin here because I talked with him about this feature.
The way I see it, we can easily get the start time and the preemption times through the logs. Submission time is harder, either we append it to the log manually, or we add it in the DelayedFunction object (although accessing it would require loading the pickle which may be heavy, and it would not be preemption proof, so not sure). Also, I have no clear idea on an API for that, any thoughts @gwenzek ?
If we are talking about SLURM then sacct
already know all the information we want (and more) about the job: start time, end time, cpu utilization, disk read write, ...
Maybe we could add a Python API to expose this. But that's maybe over-engineering and will be pretty slurm specific.
@leonardblier what are you doing with you jobs ? And how do you find the list of past job ? Because to get the time you can just look at the timestamp of the job.paths.submission_file
I would be careful at avoiding extra calls to the cluster, unless everything goes through the watcher
Adding a Python API would be as easy as reading an self.sacct_fields
in the SlurmInfoWatcher
and use it here
instead of "JobID,State,NodeList"
.
Then one could modify the list of fields through job.watcher.sacct_fields.extend(["TresUsageInMax", "TresUsageInAve"])
and read it through job.get_info()["TresUsageInMax"]
See the following commit that added NodeList: https://github.com/facebookincubator/submitit/pull/1615/commits/19b3487384b333c0653566db6ebd3da9d9af65ec#diff-1d3775b96c8f577427238099cf12f582b460fac12b9c2bb7c7f66abdceb6db49R50