TL;DR

Kubeflow PyTorch can be configured to use 0 workers when running distributed PyTorch jobs. In this case, the training job would run on a single machine (the master node), without any additional worker nodes.
Pass objectMeta to PyTorchJob

Type

[x] Bug Fix
[ ] Feature
[ ] Plugin

Are all requirements met?

[ ] Code completed
[ ] Smoke tested
[x] Unit tests added
[ ] Code documentation added
[ ] Any pending items have an associated Issue

May 09 '23 04:05 ByronHsu

Codecov Report

Merging #348 (dce4e46) into master (76a80ec) will increase coverage by 1.30%. The diff coverage is 100.00%.

:exclamation: Current head dce4e46 differs from pull request most recent head 2065a96. Consider uploading reports for the commit 2065a96 to get more accurate results

@@            Coverage Diff             @@
##           master     #348      +/-   ##
==========================================
+ Coverage   62.76%   64.06%   +1.30%     
==========================================
  Files         148      148              
  Lines       12444    10080    -2364     
==========================================
- Hits         7810     6458    -1352     
+ Misses       4038     3026    -1012     
  Partials      596      596

Flag	Coverage Δ
unittests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...o/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go	`79.80% <100.00%> (+0.28%)`	:arrow_up:

... and 130 files with indirect coverage changes

May 09 '23 05:05 codecov[bot]

is there any benefit to use only one master node in pytorch CRD? people can just run pytorch in regular python task for single node training, right?

I can only imagine the env vars set by the operator like world size, rank, ...? In the torch elastic task we opted to run single worker trainings in a normal python task/single pod so that users don't need the training operator.

May 09 '23 07:05 fg91

In our case, we start with 0 worker since it's easier to debug, and then adjust to multiple workers.

Although python task can achieve the same thing, we shouldn't error out 0 worker in PyTorch because it's what PyTorch operator allows

May 09 '23 14:05 ByronHsu

In our case, we start with 0 worker since it's easier to debug, and then adjust to multiple workers.

Although python task can achieve the same thing, we shouldn't error out 0 worker in PyTorch because it's what PyTorch operator allows

Fair point. Users can still switch to python task if they prefer by removing the task_config=Pytorch(....

Only thing I wonder: in the new pytorch elastic task we decided that with nnodes=1 (meaning only a single worker/pod) we will use a normal python task/pod so that users can run this without operator (see here). Should we maybe then change this to not have different behaviour between elastic and non-elastic pytorch dist training? @kumare3 argued that it would be nice to allow users to use torchrun without the operator.

May 09 '23 16:05 fg91

I actually want to get rid of a required dependency on PytorchOperator for simple single node training. which can suffice in many cases. This actually makes scaling really nice, you start with one node and simply scale to more nodes - to scale you may need to deploy the operator?

This is why when nnodes=1, we just change the task type itself. WDYT? @fg91 and @ByronHsu

May 10 '23 00:05 kumare3

check this out - https://github.com/flyteorg/flytekit/blob/e44b8027bd0f59c0f90eb5f86bfa181aebad0d74/plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py#L112

May 10 '23 00:05 kumare3

Could you elaborate the drawback to use PyTorch Operator to training a single node case?

May 10 '23 00:05 ByronHsu

@ByronHsu - FlytePropeller is way more optimal in allocating resources, retrying, and completing sooner. Also this for single node is faster, runs without needing an operator and does not need a CRD to be created.

May 10 '23 04:05 kumare3

@kumare3 skipping the CRD part can definitely be faster. Thanks. I will raise a corresponding pr in flytekit

May 10 '23 04:05 ByronHsu

Will merge with https://github.com/flyteorg/flyteplugins/pull/345 and do a integration test

May 16 '23 05:05 ByronHsu

Is this still supposed to be pushed over the finish line or shell we close it?

Aug 16 '23 09:08 fg91

flyteplugins
flyteplugins copied to clipboard

Allow 0 worker in pytorch plugins & Add objectMeta to PyTorchJob

TL;DR

Type

Are all requirements met?

Codecov Report

flyteplugins flyteplugins copied to clipboard

Allow 0 worker in pytorch plugins & Add objectMeta to PyTorchJob

TL;DR

Type

Are all requirements met?

Codecov Report

flyteplugins
flyteplugins copied to clipboard