signac-flow icon indicating copy to clipboard operation
signac-flow copied to clipboard

Outsource cluster submission

Open vyasr opened this issue 5 years ago • 1 comments
trafficstars

Problem description

One of the hardest parts of flow to maintain and test is the submission of jobs to clusters. The definition and execution of a workflow graph are well-defined and well-scoped problems that signac-flow solves well, but cluster submission is a somewhat separate problem, especially once more dynamic queuing behavior is desired. The various layers of sophistication that we've built over the years (environment classes, Jinja templates, flow groups in #114, etc) have helped us enable most of the features that we would like, but adding new environments and testing them sufficiently remains a cumbersome and somewhat error-prone task given the various possible submission configurations and resource requests.

Proposed solution

It would be very helpful if signac-flow could do all the work up to defining the command to be executed, and then hand off this command to another utility responsible for generating job scripts, submitting to the cluster, tracking jobs, etc. Multiple other tools exist for this purpose, but not all of them are suitable for usage with signac-flow. We should evaluate and test these to see if any of the options could potentially replace the submission components of signac-flow.

Desired features

Feel free to add anything that's missing to this list.

  • Serverless
  • Supports Slurm, Torque, LSF
  • Can be deployed on all the clusters we support
  • Accepts arbitrary commands for execution
  • Supports bundling of multiple job-operations (essentially, any pilot job system)

Options

  • MyQueue (https://gitlab.com/myqueue/myqueue): MyQueue is a lightweight pure Python library for submission to Slurm, LSF, and Torque clusters (which covers our needs). It has both a CLI and a Python API that we could leverage in signac. It can accept both Python scripts/modules and arbitrary strings for execution, so it seems similar to how we currently structure our submission model with FlowOperations. It's unclear if it has the same granularity when it comes to resource requests, but if not I suspect that it could be added with limited extra work. MyQueue has the benefit of being very lightweight, but the downside of probably not being extensively used or well-tested yet.
  • Parsl (https://parsl-project.org/): The Parsl project is supported out of Argonne National Labs and is designed to support concurrent operations on arbitrary platforms. The focus is on achieving concurrency via the extensive usage of futures, i.e. writing functions whose results are treated as promises that can be passed around and evaluated lazily when needed. Parsl has extensive support for configuring submission to essentially all of the machines we use, and even has explicit examples for pretty much everything other than Bridges here. Using Parsl would also allow us to use e.g. AWS or Kubernetes clusters. The primary downside I see with Parsl (assuming that it works) is that it is a larger dependency than anything we've been comfortable introducing behavior; conversely, it is widely used and probably can be depended on.
  • RADICAL-Pilot (https://radicalpilot.readthedocs.io/en/latest/): RADICAL-Pilot is developed by a research group at Rutgers as part of the RADICAL Cybertools project. Its primary stated focus is acquiring large resource sets and then distributing ComputeUnits across these resource sets. The project seems very powerful, but I'm more hesitant about it because configuration seems far more verbose than is required by the other tools. Also, and I think this is probably a dealbreaker, it requires a MongoDB backend. @bdice may have looked into this more at some point and have more information.
  • DASK (https://docs.dask.org/en/latest/): Dask is a Python package that serves a variety of parallel computing. The relevant parts are probably dask.jobqueue, which controls the interaction with specific types of queues, and dask.distributed, which I imagine is how execution would be configured and "submitted" to a dask jobqueue. I think @csadorf tried integrating flow with dask at some point and may have more of an idea on how feasible an option this is.

If anyone knows of or finds any other alternatives, feel free to add to this list. We should evaluate which, if any of these, would provide a suitable replacement for our own internal management.

vyasr avatar Jan 21 '20 17:01 vyasr

I haven't investigated RADICAL-Pilot very deeply, so I don't have much info to add there.

  • QCFractal (http://docs.qcarchive.molssi.org/projects/qcfractal/en/latest/) has many similar features, though it is currently built for mostly quantum chemistry use cases. I filed an issue a while back to propose generalizing its job/submission handling: https://github.com/MolSSI/QCFractal/issues/367

bdice avatar Jan 21 '20 19:01 bdice