gridmap Queue jobs in internal queue instead of dumping all jobs on cluster at once

Queue jobs in internal queue instead of dumping all jobs on cluster at once

Open sandeepklr opened this issue 9 years ago • 3 comments

Hi Dan,

I have modified code to include a queue for maximum number of jobs to run on the cluster at any time.

Please find below a summary of the changes:

Use max_processes parameter for maximum # of cluster jobs to run at once.
Create a Session object before starting JobMonitor and embed the Session object in the job monitor. - Everywhere that used a session_id now uses the embedded session object in the JobMonitor.
Function _submit_jobs() is no longer used. All jobs are submitted from the JobMonitor using _append_job_to_session()
check_alive() function has been refactored into two functions: check_alive() and check_job_status(): - check_alive() is still called everytime the local heartbeat is received - check_alive() goes through the queue and looks for jobs to remove from queue either because they have finished, or they have hit the maximum number of resubmits in case of errors. Depending on the number of empty slots, new jobs are spun up.
all_jobs_done() is now simplified to just check that ALL jobs have been processed on the cluster.