torchx icon indicating copy to clipboard operation
torchx copied to clipboard

add LSF scheduler

Open takeshi-yoshimura opened this issue 3 years ago • 10 comments

I prototyped the LSF scheduler for torchx. It supports native, Docker, and Singularity as runtime with a shared filesystem at this moment. I confirmed it worked with Gloo and NCCL on small VPC V100 clusters.

Note: torchx log command is available only when the torchx host shares the filesystem with cluster nodes (e.g., NFS).

In a nutshell, the LSF scheduler translates a torchx request to be LSF job submissions (i.e., bsub). For distributed apps, it creates multiple bsub. I also added lsf to scripts/component_integration_tests.py. Here is the log output with my three-node LSF cluster and you can find dryrun results there.

component_integration_tests.lsf.txt

Regarding Singularity image compatibility, it already automates to convert docker images into singularity image format, and so, only we have to do is to generate singularity-exec arguments from torchx requests. Note that users still need to set prefix docker:// for image names if they want to use docker images.

The following are example commands.

Example: native hello_world and CLI utils

$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=native utils.echo --msg hello_world --num_replicas 3
lsf://torchx/echo-pxc3gn5ct061k
$ torchx list -s lsf
$ torchx status lsf://torchx/echo-pxc3gn5ct061k
$ torchx cancel lsf://torchx/echo-pxc3gn5ct061k
$ torchx log --stream stdout lsf://torchx/echo-pxc3gn5ct061k/echo/0

Example: Docker hello_world

$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=docker utils.echo --image alpine:latest --msg hello_world --num_replicas 3

Example: Singularity hello_world

$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=singularity utils.echo --image docker://alpine:latest --msg hello_world --num_replicas 3

Example: Docker Distributed

$ cp scripts/dist_app.py /mnt/data/dist/
$ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=docker,host_network=True" dist.ddp -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data"

Example: Singularity Distributed

$ cp scripts/dist_app.py /mnt/data/dist/
$ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=singularity,host_network=True" dist.ddp --image docker://ghcr.io/pytorch/torchx:0.3.0dev0 -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data"

takeshi-yoshimura avatar Aug 25 '22 13:08 takeshi-yoshimura

@d4l3k Thank you for commenting my code! As you point out, it seems much better to use the python library and focus on Docker. I wil try to fix the code and push it again here next week.

takeshi-yoshimura avatar Aug 26 '22 12:08 takeshi-yoshimura

@takeshi-yoshimura there's a balance here -- not sure how hard it is to install the lsf library. If it's painful maybe that's not the best option

d4l3k avatar Aug 26 '22 19:08 d4l3k

@takeshi-yoshimura what's your plans for updating this pr? We'd like to include it but still needs some polish

d4l3k avatar Sep 16 '22 16:09 d4l3k

@d4l3k The python library for LSF seems not to provide pip prebuilt binaries. We need to download and build their code at running LSF nodes as far as my test. It may be difficult to work with test cases and build process for torchx.

Regarding LSF Docker image for local tests, I found no official ones so far...

Let me search more and share something again soon here. Sorry for my late responce, but this week I think I can concentrate on revising this PR.

takeshi-yoshimura avatar Sep 19 '22 13:09 takeshi-yoshimura

I updated lsf_scheduler.py according to your comment. can you please take a look? @d4l3k

Honestly speaking, I don't recommend using lsf-python-api. Its critical weak point is no support for job submissions with GPUs (https://github.com/IBMSpectrumComputing/lsf-python-api/issues/36). Also, it only has low-level, complex interfaces for python and I couldn't find good documentation on how to use it.

I am also concerned about tests as you pointed out. As far as my investigation, no public container images are available currently. The issue was also discussed in dask-jobqueue (https://github.com/dask/dask-jobqueue/issues/115). As they discussed, we can download LSF Suite Community Edition to build a Docker image. You need an IBM account to download it, but it's distributed under a free license that enables enough capability for testing (a single GPU and limited number of resources). Here is my test Dockerfile and other scripts (probably we cannot put the image on public places). Maybe we can add this kind of code for testing.

Dockerfile (lsfsce10.2.0.12-x86_64.tar.gz is downloaded from here):

from nvidia/cuda:11.7.1-devel-ubuntu20.04
ARG LSFSCE10_2_0_12=lsfsce10.2.0.12-x86_64.tar.gz
COPY $LSFSCE10_2_0_12 /
COPY startserver.sh /
COPY myinstall.config /

ENV HOSTNAME lsf

RUN apt-get update && apt-get install -y python3 python3-pip swig git ed vim && rm -rf /var/cache/apt/* && \
    useradd -m lsfadmin && \
    cd / && tar xzf lsfsce10.2.0.12-x86_64.tar.gz && cd lsfsce10.2.0.12-x86_64/lsf && tar xzf lsf10.1_lsfinstall_linux_x86_64.tar.Z && cd lsf10.1_lsfinstall && \
    ./lsfinstall -f /myinstall.config && rm -rf /myinstall.config /lsfsce10.2.0.12-x86_64* && echo "LSF_ROOT_USER=Y" >> /usr/share/lsf/conf/lsf.conf && \
    echo "LSB_GPU_NEW_SYNTAX=extend" >> /usr/share/lsf/conf/lsf.conf && \
    echo 'source /usr/share/lsf/conf/profile.lsf' >> /home/lsfadmin/.bashrc && echo 'source /usr/share/lsf/conf/profile.lsf' >> /root/.bashrc && \
    cd / && git clone https://github.com/IBMSpectrumComputing/lsf-python-api.git && cd lsf-python-api && \
    . /usr/share/lsf/conf/profile.lsf && python3 setup.py build && python3 setup.py install && cd / && rm -rf /lsf-python-api

USER root

This Dockerfile contains lsf-python-api installation as well. I found pip package for this, but it was not updated for years https://pypi.org/project/platform-python-lsf-api/.

myinstall.config:

LSF_TOP=/usr/share/lsf
LSF_ADMINS=lsfadmin
LSF_CLUSTER_NAME=lsf
LSF_MASTER_LIST=lsf
SILENT_INSTALL=Y
LSF_SILENT_INSTALL_TARLIST=ALL
ACCEPT_LICENSE=Y

startserver.sh:

#!/bin/bash
source /usr/share/lsf/conf/profile.lsf
lsf_daemons start

takeshi-yoshimura avatar Sep 20 '22 11:09 takeshi-yoshimura

Public images for LSF were deleted for security reasons in the past. An official instruction to build an LSF image is https://github.com/IBMSpectrumComputing/lsf-operator/blob/main/README-Building-the-images.md.

takeshi-yoshimura avatar Sep 22 '22 14:09 takeshi-yoshimura

Codecov Report

Merging #588 (f8e9a0b) into main (b70811e) will decrease coverage by 0.52%. The diff coverage is 89.23%.

@@            Coverage Diff             @@
##             main     #588      +/-   ##
==========================================
- Coverage   94.94%   94.42%   -0.53%     
==========================================
  Files          67       64       -3     
  Lines        4134     4429     +295     
==========================================
+ Hits         3925     4182     +257     
- Misses        209      247      +38     
Impacted Files Coverage Δ
torchx/schedulers/__init__.py 95.23% <ø> (ø)
torchx/schedulers/lsf_scheduler.py 89.23% <89.23%> (ø)
torchx/util/entrypoints.py 89.28% <0.00%> (-10.72%) :arrow_down:
torchx/specs/named_resources_aws.py 93.33% <0.00%> (-6.67%) :arrow_down:
torchx/runner/api.py 94.87% <0.00%> (-2.03%) :arrow_down:
torchx/specs/__init__.py 94.28% <0.00%> (-2.02%) :arrow_down:
torchx/specs/api.py 98.40% <0.00%> (ø)
torchx/util/types.py 100.00% <0.00%> (ø)
torchx/cli/cmd_log.py 95.74% <0.00%> (ø)
torchx/specs/finder.py 96.98% <0.00%> (ø)
... and 9 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov[bot] avatar Sep 23 '22 17:09 codecov[bot]

If lsf-python-api isn't in a good shape there's no strong need to use it then. That was my concern when I looked at it before so glad you have the same opinion. Lack of GPU support is a big blocker wow

For Slurm we just use the CLI and it's stable enough -- we can just mock the inputs/outputs by using patch on subprocess which works fairly well.

Re: testing -- having an LSF integration test isn't a blocker for landing this diff. We can mark this as prototype and add integ testing in a follow up diff.

If there's ways to programmatically fetch the LSF scheduler and install it using the credentials we can add some creds to GitHub secrets which should keep them safe

Do we have any contacts at LSF? Wondering if this policy around docker images is something that we can get changed/exempted from. Is it an option to get a small managed LSF test cluster provided by IBM? We can chat more on Slack

d4l3k avatar Sep 23 '22 17:09 d4l3k

@takeshi-yoshimura Think the main thing blocking this particular diff is just adding some comprehensive unit tests (and fixing lint/pyre)

d4l3k avatar Sep 23 '22 21:09 d4l3k

@d4l3k I fixed lint and pyre with unit tests. please check them. To unit tests without subprocess calls, I separated parser logic from the LsfScheduler methods.

Do we have any contacts at LSF? Wondering if this policy around docker images is something that we can get changed/exempted from. Is it an option to get a small managed LSF test cluster provided by IBM? We can chat more on Slack

I'm afraid I cannot get approval to have IBM hosts only for this test. I also asked LSF developers for the docker images but got no good answers. To be honest, I have no idea to solve this right now.

There is a forum page for LSF https://community.ibm.com/community/user/businessanalytics/communities/community-home/digestviewer?communitykey=74d589b7-7276-4d70-acf5-0fc26430c6c0. I keep asking in IBM, but we can also tell our issue at the open channel.

By the way, I requested a slack invitation from https://pytorch.org/resources last week but got no replies yet. Is the system working?

If there's ways to programmatically fetch the LSF scheduler and install it using the credentials we can add some creds to GitHub secrets which should keep them safe

this idea is also difficult. The download page for the LSF community edition needs authentication at Web Browser.

takeshi-yoshimura avatar Oct 04 '22 14:10 takeshi-yoshimura

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Oct 07 '22 17:10 facebook-github-bot

@d4l3k Thank you! I know this is only the first step for the LSF scheduler. I think I need to keep working on integration tests and documentation (, plus Singularity after dependent changes). Do we have any other TODO items for the LSF scheduler?

takeshi-yoshimura avatar Oct 08 '22 04:10 takeshi-yoshimura

@takeshi-yoshimura The biggest current gap right now is workspaces. Having support for the DockerWorkspace to allow patching the images before launching the job would be very nice

if you want to add singularity support that'd be a big help -- it'd be pretty nice to add some abstraction for the container execution side here (i.e. same interface for docker vs singularity) that we could use across a couple of schedulers

Testing would also be a big help and circulating it around to get some user feedback

d4l3k avatar Oct 10 '22 22:10 d4l3k