flyteplugins icon indicating copy to clipboard operation
flyteplugins copied to clipboard

Allow 0 worker in pytorch plugins & Add objectMeta to PyTorchJob

Open ByronHsu opened this issue 1 year ago • 11 comments

TL;DR

  1. Kubeflow PyTorch can be configured to use 0 workers when running distributed PyTorch jobs. In this case, the training job would run on a single machine (the master node), without any additional worker nodes.
  2. Pass objectMeta to PyTorchJob

Type

  • [x] Bug Fix
  • [ ] Feature
  • [ ] Plugin

Are all requirements met?

  • [ ] Code completed
  • [ ] Smoke tested
  • [x] Unit tests added
  • [ ] Code documentation added
  • [ ] Any pending items have an associated Issue

ByronHsu avatar May 09 '23 04:05 ByronHsu

Codecov Report

Merging #348 (dce4e46) into master (76a80ec) will increase coverage by 1.30%. The diff coverage is 100.00%.

:exclamation: Current head dce4e46 differs from pull request most recent head 2065a96. Consider uploading reports for the commit 2065a96 to get more accurate results

@@            Coverage Diff             @@
##           master     #348      +/-   ##
==========================================
+ Coverage   62.76%   64.06%   +1.30%     
==========================================
  Files         148      148              
  Lines       12444    10080    -2364     
==========================================
- Hits         7810     6458    -1352     
+ Misses       4038     3026    -1012     
  Partials      596      596              
Flag Coverage Δ
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...o/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go 79.80% <100.00%> (+0.28%) :arrow_up:

... and 130 files with indirect coverage changes

codecov[bot] avatar May 09 '23 05:05 codecov[bot]

is there any benefit to use only one master node in pytorch CRD? people can just run pytorch in regular python task for single node training, right?

I can only imagine the env vars set by the operator like world size, rank, ...? In the torch elastic task we opted to run single worker trainings in a normal python task/single pod so that users don't need the training operator.

fg91 avatar May 09 '23 07:05 fg91

In our case, we start with 0 worker since it's easier to debug, and then adjust to multiple workers.

Although python task can achieve the same thing, we shouldn't error out 0 worker in PyTorch because it's what PyTorch operator allows

ByronHsu avatar May 09 '23 14:05 ByronHsu

In our case, we start with 0 worker since it's easier to debug, and then adjust to multiple workers.

Although python task can achieve the same thing, we shouldn't error out 0 worker in PyTorch because it's what PyTorch operator allows

Fair point. Users can still switch to python task if they prefer by removing the task_config=Pytorch(....

Only thing I wonder: in the new pytorch elastic task we decided that with nnodes=1 (meaning only a single worker/pod) we will use a normal python task/pod so that users can run this without operator (see here). Should we maybe then change this to not have different behaviour between elastic and non-elastic pytorch dist training? @kumare3 argued that it would be nice to allow users to use torchrun without the operator.

fg91 avatar May 09 '23 16:05 fg91

I actually want to get rid of a required dependency on PytorchOperator for simple single node training. which can suffice in many cases. This actually makes scaling really nice, you start with one node and simply scale to more nodes - to scale you may need to deploy the operator?

This is why when nnodes=1, we just change the task type itself. WDYT? @fg91 and @ByronHsu

kumare3 avatar May 10 '23 00:05 kumare3

check this out - https://github.com/flyteorg/flytekit/blob/e44b8027bd0f59c0f90eb5f86bfa181aebad0d74/plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py#L112

kumare3 avatar May 10 '23 00:05 kumare3

Could you elaborate the drawback to use PyTorch Operator to training a single node case?

ByronHsu avatar May 10 '23 00:05 ByronHsu

@ByronHsu - FlytePropeller is way more optimal in allocating resources, retrying, and completing sooner. Also this for single node is faster, runs without needing an operator and does not need a CRD to be created.

kumare3 avatar May 10 '23 04:05 kumare3

@kumare3 skipping the CRD part can definitely be faster. Thanks. I will raise a corresponding pr in flytekit

ByronHsu avatar May 10 '23 04:05 ByronHsu

Will merge with https://github.com/flyteorg/flyteplugins/pull/345 and do a integration test

ByronHsu avatar May 16 '23 05:05 ByronHsu

Is this still supposed to be pushed over the finish line or shell we close it?

fg91 avatar Aug 16 '23 09:08 fg91