signac-flow icon indicating copy to clipboard operation
signac-flow copied to clipboard

Mapping of scheduler jobs to project

Open csadorf opened this issue 8 years ago • 4 comments

Original report by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).


Problem

The FlowProject currently provides the scheduler_jobs() and the map_scheduler_jobs() method. These can be used to identify scheduler jobs that belong to the project within the current environment, but are still kind of awkward to use. For example, it should be simple to iterate through all scheduler-jobs associated with the current project, e.g. to change their status.

Current Solution

This is the code currently required to do so:

#!python
import flow

project = flow.FlowProject()
env = flow.get_environment()

sjobs = project.scheduler_jobs(env.scheduler_type())
sjobs_map = project.map_scheduler_jobs(sjobs)

for job in project:
    for sjobs in sjobs_map[job.get_id()].values():
        for sjob in sjobs:
            # do something with sjob

The reason for this rather convoluted approach is to enforce the querying of the environment scheduler only once as opposed to multiple times, for example for each job.

Proposed Enhancement

I propose to protect the environment scheduler resource, using the following API:

#!python
import flow

project = FlowProject()
env = flow.get_environment()

result = project.query_scheduler(env)
for job in project:
    for op_name, sjob in result(job):
        # do something with sjob

csadorf avatar Mar 07 '17 15:03 csadorf

This issue, or #146, would solve a problem raised by @ramanishsingh and @rsdefever at the @mosdef-hub all-hands meeting. They want to be able to put log files generated by PBS/SLURM into the corresponding job directory folder. Of course there isn't a 1-1 mapping between scheduler jobs and signac jobs, but we could probably find a way to do better than the current behavior.

bdice avatar Mar 04 '20 16:03 bdice

A related use is tracking scheduler job IDs to track down errors more easily. It might be a separate issue. There are two steps in this translation: scheduler ID --> "flow submission ID" --> job ID

My current solution involves:

(1) using a custom template that emails me the status, so I get an email with a subject like:

SLURM Job_id=37428317 Name=project_name/6d6df7ab/run/0000/22da1a8a1dc67ca8783a9d3d9db5c598 Began, Queued time 00:00:23

(where 22da1a... is the "bundle ID" and 6d6d... is the job.id.) when it queues, completes, fails. I use this to associate scheduler ID to flow's submission ID.

Flow prints out this in this case

 - Group: run(6d6df7ab68f4591d5e1a05065683e78b)

(2) saving (currently by copy-paste but I know I could dump the output to a file) what flow prints out when I submit jobs. This is more of a problem when bundling. Say I submit a bundle of 100 jobs. While submitting, flow prints out

 - Group: run(c835975646cd37e561b6cbf8e7d2facd)
 - Group: run(6cf6a57ce13eb422a2306bc40142a49c)
 - Group: run(53e44e09a38c4d818afaf9221cb57d69)
   [truncated]

This is the only way I know to associate the submission ID (now the ID of the bundle) with the job.id. I look up the submission ID and find jobs in this list or find the job.id and go to its parent submission ID.

A possible solution? By default, set job name to full job id? This could help searching scheduler submission --job-name="project_name/6d6df7ab/run_big/0000/22da1a8a1dc67ca8783a9d3d9db5c598"

cbkerr avatar Dec 18 '20 16:12 cbkerr

Do schedulers return job ID? SLURM: Officially no https://slurm.schedmd.com/sbatch.html (find "RETURN VALUE" section). We can definitely get job ID after creating from squeue, but I don't know if that number is assigned right away.

However (!), I found references to scripts that return the ID when submitting:

  1. https://ubccr.freshdesk.com/support/solutions/articles/5000688140-submitting-a-slurm-job-script (search for "step 4" to see output and also look at the scripts in step 1--3)
  2. https://kb.iu.edu/d/awrz (search for "submit your job script")

On Torque, I think the answer is yes from discussion below.

cbkerr avatar Dec 18 '20 16:12 cbkerr

Some notes from discussion with @csadorf and @bdice yesterday afternoon:

  • Not all schedulers return scheduler ID, just "success" status
  • I learned that we store bundle information in project_root/.bundles/project_name/bundle/[bundle_IDs] (which is what gets printed out when you submit a bundle)
  • Relevant function FlowProject._expand_bundled_jobs()
  • Relevant function FlowProject._fetch_scheduler_status()
  • The "flow submission ID" is deterministically created

cbkerr avatar Dec 18 '20 16:12 cbkerr