torchx icon indicating copy to clipboard operation
torchx copied to clipboard

[Req] LSF scheduler support

Open ckddls1321 opened this issue 3 years ago • 6 comments

Description

LSF scheduler support Does torchx team have plan to support LSF scheduler? Or is there any guide for extension, I would make PR.

Motivation/Background

Thanks for torchx utils. We can target various scheduler by configure torchxconfig.

Detailed Proposal

It would be better to support LSF scheduler.

ckddls1321 avatar Mar 29 '22 04:03 ckddls1321

Hi there, adding a new scheduler to TorchX is quite straight forward. Here are the basic steps:

  1. Subclass the torchx.schedulers.Scheduler interface. There are a few methods you need to implement - the APIs on the interface describes what each API should do and any assumptions.
  2. (Optional) Register the new scheduler implementation in the list of default schedulers. You only need to do this if you want everyone else to have access to the LSF scheduler. Otherwise you can register it only for yourself via python entrypoints as described here.
  3. The unittests for each file/function you add should go in the **/test directory as {file_name}_test.py our CI picks up all the *_test.py automatically from **/test directories.

You can check out the AWS Batch and Slurm scheduler implementations for reference.

No need to do anything special for .torchxconfig to pick up the settings for the new scheduler. You can add a section like

# .torchxconfig
[lsf]
runcfg1 = value1
runcfg2 = value2
...

kiukchung avatar Mar 30 '22 20:03 kiukchung

@ckddls1321 where are you running LSF? Are you trying to use this for work purposes?

d4l3k avatar Mar 31 '22 19:03 d4l3k

@ckddls1321 How are you packaging code for running on NFS is it using Podman/Singularity? Or just a shared NFS mount like slurm?

A friend from Oak Ridge National Laboratory pointed me to https://code.ornl.gov/olcf-analytics/summit/distributed-deep-learning-examples/-/tree/master/examples/pytorch/BERT which is an example of how to run BERT on Summit super computer via NSF.

Summit supports using Podman so maps well to our docker usage https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=containers-lsf-podman

d4l3k avatar Mar 31 '22 21:03 d4l3k

@ckddls1321 where are you running LSF? Are you trying to use this for work purposes?

I consider to use Torchx for work purpose and my personal research interest. Thanks for suggestion. I will take a look into Podman. We also have same strategy as Summit does. But we use mpi to launch distributed process.

ckddls1321 avatar Apr 05 '22 06:04 ckddls1321

Hi, @ckddls1321 @d4l3k @kiukchung I created LSF scheduler with Docker/Singularity and NFS. please check my PR #588

takeshi-yoshimura avatar Aug 25 '22 13:08 takeshi-yoshimura

Landed as part of https://github.com/pytorch/torchx/commit/6360df39dc465a9e254045febaa0e9a04ae553de

d4l3k avatar Oct 10 '22 22:10 d4l3k