gridmap
gridmap copied to clipboard
Queue jobs in internal queue instead of dumping all jobs on cluster at once
Hi Dan,
I have modified code to include a queue for maximum number of jobs to run on the cluster at any time.
Please find below a summary of the changes:
- Use max_processes parameter for maximum # of cluster jobs to run at once.
- Create a Session object before starting JobMonitor and embed the Session object in the job monitor. - Everywhere that used a session_id now uses the embedded session object in the JobMonitor.
- Function _submit_jobs() is no longer used. All jobs are submitted from the JobMonitor using _append_job_to_session()
- check_alive() function has been refactored into two functions: check_alive() and check_job_status(): - check_alive() is still called everytime the local heartbeat is received - check_alive() goes through the queue and looks for jobs to remove from queue either because they have finished, or they have hit the maximum number of resubmits in case of errors. Depending on the number of empty slots, new jobs are spun up.
- all_jobs_done() is now simplified to just check that ALL jobs have been processed on the cluster.
Repository health increased by 25% when pulling 174ab9b on sandeepklr:master into c291881 on pygridtools:master.
- 12 new problems were found (including 4 errors and 6 code smells).
- 67 problems were fixed (including 65 errors and 1 code smell).
Repository health increased by 26% when pulling a3a0b7b on sandeepklr:master into c291881 on pygridtools:master.
- 12 new problems were found (including 3 errors and 7 code smells).
- 67 problems were fixed (including 65 errors and 1 code smell).
HI @sandeepklr can you please refresh this PR if you are still interested in merging this in? Thanks!