gridmap icon indicating copy to clipboard operation
gridmap copied to clipboard

Queue jobs in internal queue instead of dumping all jobs on cluster at once

Open sandeepklr opened this issue 9 years ago • 3 comments

Hi Dan,

I have modified code to include a queue for maximum number of jobs to run on the cluster at any time.

Please find below a summary of the changes:

  • Use max_processes parameter for maximum # of cluster jobs to run at once.
  • Create a Session object before starting JobMonitor and embed the Session object in the job monitor. - Everywhere that used a session_id now uses the embedded session object in the JobMonitor.
  • Function _submit_jobs() is no longer used. All jobs are submitted from the JobMonitor using _append_job_to_session()
  • check_alive() function has been refactored into two functions: check_alive() and check_job_status(): - check_alive() is still called everytime the local heartbeat is received - check_alive() goes through the queue and looks for jobs to remove from queue either because they have finished, or they have hit the maximum number of resubmits in case of errors. Depending on the number of empty slots, new jobs are spun up.
  • all_jobs_done() is now simplified to just check that ALL jobs have been processed on the cluster.

sandeepklr avatar Jun 21 '15 14:06 sandeepklr

Code Health Repository health increased by 25% when pulling 174ab9b on sandeepklr:master into c291881 on pygridtools:master.

landscape-bot avatar Jun 21 '15 14:06 landscape-bot

Code Health Repository health increased by 26% when pulling a3a0b7b on sandeepklr:master into c291881 on pygridtools:master.

landscape-bot avatar Jun 23 '15 05:06 landscape-bot

HI @sandeepklr can you please refresh this PR if you are still interested in merging this in? Thanks!

desilinguist avatar Apr 26 '21 22:04 desilinguist