Toil-Internal Scheduler to run multiple Toil jobs per Batch System job, to make Toil jobs lightweight
General note: Toil is a pipeline manager that sits on top of cluster or cloud schedulers. This is the same niche as filled by Cromwell and Nextflow. It may be worth looking at how these systems implement interfacing with cloud and batch systems.
JobTree was designed around the assumption that the scheduler could handle a large number of jobs very quickly, as did the targeted scheduler, Parasol. This was specifically targeted to support the desired Cactus programming model. When Toil was created from JobTree, this assumption was never changed.
This continues to be a problem for Cactus running in all environments. It is especially problematic for HPC batch systems where scheduling jobs and checking their status is expensive. It is also related to the problems with container systems, where starting a new container per job is expensive.
The proposal is to separate the concept of toil jobs (t-jobs) from batch systems (bs-jobs). Each bs-job would be long-lived and execute one t-job at a time. Each bs-job would run a new toil process to run and monitor t-jobs on that bs-job (bs-runner). It communicates with the toil leader, which sends one t-job at a time to the bs-runner, waits for the t-job to finish, and returns the t-job status to the leader. The toil leader would then dispatch another t-job to the bs-runner.
A cluster then consists of a fixed number of bs-jobs (this is --maxCores). When a bs-jobs has lived for a specified time, it exits to allow other cluster users in. The toil leader can then create new bs-jobs.
While from a CS perspective, this is very straight-forward scheduling, the implementation within the Toil framework may be a fair bit of work, although I suspect it may simplify things in the long run by better partitioning functionality.
Toil would be responsible for queuing jobs. It already has all the job information in memory, and this would give it explicit control over service jobs queuing. It might make deadlock detection easier.
Toil would need to manage pools of bs-jobs, some with different attributes, such as memory and maximum run times. Since it isn't trying to pack multiple t-jobs into bs-jobs, the memory can be driven by what is required by the t-jobs. It can also be fairly simplistic about stopping bs-jobs to allow starting larger memory one.
The Toil leader threading model might become a bottleneck for scheduling jobs, but at least that process would be the only one overwhelmed.
Toil currently has two methods of communication with nodes: the file system or object store and the commands to talk to the batch system. It is unclear if it would work to use the jobstore as a communication mechanism under this model.
┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-553
➤ Adam Novak commented:
This is a lot of work and not on the path from where we are to having a nice Slurm-backed WDL solution, and Cactus is getting by with its hacks for now.
➤ Adam Novak commented:
We now have a good Slurm-backed WDL solution, and I think heavyweight Toil jobs is slowing it down, so we could think about revisiting this.
Part of the workaround we have now is lots of local jobs to do interpreter work on the leader machine without scheduling out.