Submitty
Submitty copied to clipboard
Parallelize worker software/system updates
What problem are you trying to solve with Submitty When INSTALL_SUBMITTY.sh is run on the primary machine, it kicks off updates on each of the worker machines. These updates may include costly/slow recompilation of software, downloads of new packages, docker images, etc. Currently all of these updates are done one machine at a time, in serial. The system is partially offline until all updates are completed, and this is the sum of the time of all updates.
Describe the way you'd like to solve this problem Instead, we would like do the worker updates in parallel. So the offline time is equal to the slowest machine, rather than the sum.
OTHER NOTES:
This should be tested with a variety of worker machine states:
-
A machine that is intentionally disabled (the machine is set to "enabled" : false in the /usr/local/submitty/config/autograding_workers.json file)
-
A machine is very temporarily unreachable due to a very short network outage (< 1 minute) -- in this case we should re-try the connection a short time later. If a machine is not reachable after multiple attempts in a modest amount of time (> a few minutes) then the machine should be skipped and the end result of the command should be a warning/error.
-
A machine that was reached initially in the update, but has crashed or otherwise become unresponsive. This machine should should be skipped and the end result of the update command should be a warning/error.
-
A machine that is marked "enabled" : true but cannot be reached at all. There was/is code in the update scripts to automatically mark this machine as "enabled" : false -- but I'm not sure that feature is functional, and I'm not sure if that's the design we want to use. This machine should should be skipped and the end result of the update command should be a warning/error.