submitit icon indicating copy to clipboard operation
submitit copied to clipboard

Access some information about the job when reloading it

Open leonardblier opened this issue 4 years ago • 4 comments

Hi!

Would it be possible to have access to some information about a job when reloading a Job with its job_id?

My use case is the following: I launched a lot of jobs, and I want to plot some metrics I logged. Most of the time, I only care about the jobs I just launched, or the jobs I launched the day before. Therefore, I would need to filter my jobs according to their launching time. If I'm correct, this is not currently possible.

Other information might be interesting, for instance knowing whether a job has been preempted, since this is a common bug source.

I tag @jrapin here because I talked with him about this feature.

leonardblier avatar Nov 13 '20 12:11 leonardblier

The way I see it, we can easily get the start time and the preemption times through the logs. Submission time is harder, either we append it to the log manually, or we add it in the DelayedFunction object (although accessing it would require loading the pickle which may be heavy, and it would not be preemption proof, so not sure). Also, I have no clear idea on an API for that, any thoughts @gwenzek ?

jrapin avatar Nov 16 '20 10:11 jrapin

If we are talking about SLURM then sacct already know all the information we want (and more) about the job: start time, end time, cpu utilization, disk read write, ... Maybe we could add a Python API to expose this. But that's maybe over-engineering and will be pretty slurm specific.

@leonardblier what are you doing with you jobs ? And how do you find the list of past job ? Because to get the time you can just look at the timestamp of the job.paths.submission_file

gwenzek avatar Nov 16 '20 11:11 gwenzek

I would be careful at avoiding extra calls to the cluster, unless everything goes through the watcher

jrapin avatar Nov 16 '20 11:11 jrapin

Adding a Python API would be as easy as reading an self.sacct_fields in the SlurmInfoWatcher and use it here instead of "JobID,State,NodeList". Then one could modify the list of fields through job.watcher.sacct_fields.extend(["TresUsageInMax", "TresUsageInAve"]) and read it through job.get_info()["TresUsageInMax"]

See the following commit that added NodeList: https://github.com/facebookincubator/submitit/pull/1615/commits/19b3487384b333c0653566db6ebd3da9d9af65ec#diff-1d3775b96c8f577427238099cf12f582b460fac12b9c2bb7c7f66abdceb6db49R50

gwenzek avatar Mar 18 '21 11:03 gwenzek