[Req] LSF scheduler support
Description
LSF scheduler support Does torchx team have plan to support LSF scheduler? Or is there any guide for extension, I would make PR.
Motivation/Background
Thanks for torchx utils. We can target various scheduler by configure torchxconfig.
Detailed Proposal
It would be better to support LSF scheduler.
Hi there, adding a new scheduler to TorchX is quite straight forward. Here are the basic steps:
- Subclass the
torchx.schedulers.Schedulerinterface. There are a few methods you need to implement - the APIs on the interface describes what each API should do and any assumptions. - (Optional) Register the new scheduler implementation in the list of default schedulers. You only need to do this if you want everyone else to have access to the LSF scheduler. Otherwise you can register it only for yourself via python entrypoints as described here.
- The unittests for each file/function you add should go in the
**/testdirectory as{file_name}_test.pyour CI picks up all the *_test.py automatically from**/testdirectories.
You can check out the AWS Batch and Slurm scheduler implementations for reference.
No need to do anything special for .torchxconfig to pick up the settings for the new scheduler. You can add a section like
# .torchxconfig
[lsf]
runcfg1 = value1
runcfg2 = value2
...
@ckddls1321 where are you running LSF? Are you trying to use this for work purposes?
@ckddls1321 How are you packaging code for running on NFS is it using Podman/Singularity? Or just a shared NFS mount like slurm?
A friend from Oak Ridge National Laboratory pointed me to https://code.ornl.gov/olcf-analytics/summit/distributed-deep-learning-examples/-/tree/master/examples/pytorch/BERT which is an example of how to run BERT on Summit super computer via NSF.
Summit supports using Podman so maps well to our docker usage https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=containers-lsf-podman
@ckddls1321 where are you running LSF? Are you trying to use this for work purposes?
I consider to use Torchx for work purpose and my personal research interest. Thanks for suggestion. I will take a look into Podman. We also have same strategy as Summit does. But we use mpi to launch distributed process.
Hi, @ckddls1321 @d4l3k @kiukchung I created LSF scheduler with Docker/Singularity and NFS. please check my PR #588
Landed as part of https://github.com/pytorch/torchx/commit/6360df39dc465a9e254045febaa0e9a04ae553de