pyiron_base
pyiron_base copied to clipboard
Clean up the job table
What was discussed in today's pyiron meeting.
- Remove hamilton -> job_type
- Fix CPU_time -> currently is only walltime
- remove chemical composition
I have a more fundamental question about the job table: What is it made for? If it is all about checking the status of the jobs based on their job names, we would in principle need only the job name and the current status (and maybe when it was accepted).
Now, if we want to go beyond just the job name, maybe because we cannot remember the job names, or because we move towards hashing, then we need traits or context. A simple example of traits would be chemical composition, or types of defects included, and a simple example of context would be the name of the master job. Still, it wouldn't require the total CPU time for example, or the version number.
Finally, there's the possibility, that we would like to see the jobs based on certain characteristics, such as total CPU time. This, however, starts colliding with pyiron table, but still it might be useful to have it inside the job table for very practical reasons.
I opened this issue, because "cleaning the job table" is a bit vague to me right now for the reasons I stated above. Let me know what your ideas are.
I have a more practical view of the jobtable - it should serve purely as a (somewhat) lean convenience function for grabbing job metadata:
In my own (custom) jobtable that I use in my work, I provide the following columns:
- job_name
- job status
- queue_id on system
- absolute path to file (currently broken into project_path, and whatever_path, very unwieldy, and unintuitive)
- CPU time consumed
- RAM utilised (not implemented, but desirable)
- cores used
- immediate parents of this job
- Jobtype (I don't include this, but I can see how it would be useful)
The idea is to provide metadata on the job only, and nothing else. This serves as a gateway to troubleshooting jobs (i.e. copy-paste absolute path into the terminal), cancelling jobs without deleting them (scancel queue_id), etc. To me, the pyiron database id is almost never used, and that can also go away, or stay, but I have a hard time imagining what the precise use case would be for pyiron database ids would be. I can also trace the "family tree" of a job purely a trace of the parent_id column (or at least should be able to, haven't actually done this in practice).
The time of submission, and the time of conclusion seem somewhat unnecessary, but I guess you could keep that if people really wanted it. I just have never had a use for those columns, and as such it can go away.
A very lean job_table that I would be happy with (and fits inside a jupyter output nicely) would just be the first four columns (job_name, job_status, absolute_path, queue_id). Anything extra is just a bonus.
Ideally, I can also specify:
pr.job_table(status = "unfinished")
to get all jobs with status ["non_converged", "aborted", "running", "collected"]
pr.job_table(status = "finished")
to get all jobs with status ["finished"]
pr.job_table(status = "submitted")
to get all jobs with status ["queued", "initialised"]
For practical reasons. (as currently it expects users to be familiar with the huge zoo of statuses that are assigned to jobs in pyiron) to filter it. For example, as a new user, I would have missed jobs that have "non_converged" as a potential status, when I would intuitively just expect "aborted".
For pyirons internal working the fields
-
job
,subjob
,projectpath
,project
-
id
-
status
-
parentid
andmasterid
are critical and need to be kept. Reorganizing the table beyond that has been discussed in detail with @tnecnivkcots and we settled a bit on pulling chemicalformula
into an auxillary metadata table, *time*
, computer
into a runtime table and parentid
and masterid
into a table which links jobs more generally and would allow multiple links as well. From these smaller tables the output of job_table
could be assembled with joins efficiently.
From my own work I find id
, hamilton
(should be renamed as discussed) and timestart
/timestop
extremely useful to filter and select jobs for various manual tasks, like I realized I made a mistake yesterday and can quickly re-run only those jobs or I need a single job from one work somewhere else and I can quickly copy it by id.
For the jupyter output job_table
allows to select which columns to show with the columns
arguments. I think it's actually a good idea to make that configurable via the config file to suit different needs.
Ah, and grouping status into larger categories to filter more easily is a good idea as well, but would need some thinking from us which would be useful and where to define them so that it's not too ad-hoc.
From here I think it might also be good to introduce a timecreated
or so as currently timestart
is set twice, once on saving and once on entering running
, iirc.