pyiron_base Clean up the job table

What was discussed in today's pyiron meeting.

Remove hamilton -> job_type
Fix CPU_time -> currently is only walltime
remove chemical composition

Jan 16 '23 17:01 samwaseda

I have a more fundamental question about the job table: What is it made for? If it is all about checking the status of the jobs based on their job names, we would in principle need only the job name and the current status (and maybe when it was accepted).

Now, if we want to go beyond just the job name, maybe because we cannot remember the job names, or because we move towards hashing, then we need traits or context. A simple example of traits would be chemical composition, or types of defects included, and a simple example of context would be the name of the master job. Still, it wouldn't require the total CPU time for example, or the version number.

Finally, there's the possibility, that we would like to see the jobs based on certain characteristics, such as total CPU time. This, however, starts colliding with pyiron table, but still it might be useful to have it inside the job table for very practical reasons.

I opened this issue, because "cleaning the job table" is a bit vague to me right now for the reasons I stated above. Let me know what your ideas are.

Jan 16 '23 17:01 samwaseda

I have a more practical view of the jobtable - it should serve purely as a (somewhat) lean convenience function for grabbing job metadata:

In my own (custom) jobtable that I use in my work, I provide the following columns:

job_name
job status
queue_id on system
absolute path to file (currently broken into project_path, and whatever_path, very unwieldy, and unintuitive)
CPU time consumed
RAM utilised (not implemented, but desirable)
cores used
immediate parents of this job
Jobtype (I don't include this, but I can see how it would be useful)

The idea is to provide metadata on the job only, and nothing else. This serves as a gateway to troubleshooting jobs (i.e. copy-paste absolute path into the terminal), cancelling jobs without deleting them (scancel queue_id), etc. To me, the pyiron database id is almost never used, and that can also go away, or stay, but I have a hard time imagining what the precise use case would be for pyiron database ids would be. I can also trace the "family tree" of a job purely a trace of the parent_id column (or at least should be able to, haven't actually done this in practice).

The time of submission, and the time of conclusion seem somewhat unnecessary, but I guess you could keep that if people really wanted it. I just have never had a use for those columns, and as such it can go away.

A very lean job_table that I would be happy with (and fits inside a jupyter output nicely) would just be the first four columns (job_name, job_status, absolute_path, queue_id). Anything extra is just a bonus.

Ideally, I can also specify:

pr.job_table(status = "unfinished")

to get all jobs with status ["non_converged", "aborted", "running", "collected"]

pr.job_table(status = "finished")

to get all jobs with status ["finished"]

pr.job_table(status = "submitted")

to get all jobs with status ["queued", "initialised"]

For practical reasons. (as currently it expects users to be familiar with the huge zoo of statuses that are assigned to jobs in pyiron) to filter it. For example, as a new user, I would have missed jobs that have "non_converged" as a potential status, when I would intuitively just expect "aborted".

Jan 16 '23 18:01 ligerzero-ai

For pyirons internal working the fields

job, subjob, projectpath, project
id
status
parentid and masterid

are critical and need to be kept. Reorganizing the table beyond that has been discussed in detail with @tnecnivkcots and we settled a bit on pulling chemicalformula into an auxillary metadata table, *time*, computer into a runtime table and parentid and masterid into a table which links jobs more generally and would allow multiple links as well. From these smaller tables the output of job_table could be assembled with joins efficiently.

From my own work I find id, hamilton (should be renamed as discussed) and timestart/timestop extremely useful to filter and select jobs for various manual tasks, like I realized I made a mistake yesterday and can quickly re-run only those jobs or I need a single job from one work somewhere else and I can quickly copy it by id.

For the jupyter output job_table allows to select which columns to show with the columns arguments. I think it's actually a good idea to make that configurable via the config file to suit different needs.

Jan 17 '23 09:01 pmrv

Ah, and grouping status into larger categories to filter more easily is a good idea as well, but would need some thinking from us which would be useful and where to define them so that it's not too ad-hoc.

Jan 17 '23 09:01 pmrv

From here I think it might also be good to introduce a timecreated or so as currently timestart is set twice, once on saving and once on entering running, iirc.

Jan 17 '23 10:01 pmrv

pyiron_base pyiron_base copied to clipboard

Clean up the job table

pyiron_base
pyiron_base copied to clipboard