flyteplugins
flyteplugins copied to clipboard
Allow 0 worker in pytorch plugins & Add objectMeta to PyTorchJob
TL;DR
- Kubeflow PyTorch can be configured to use 0 workers when running distributed PyTorch jobs. In this case, the training job would run on a single machine (the master node), without any additional worker nodes.
- Pass objectMeta to PyTorchJob
Type
- [x] Bug Fix
- [ ] Feature
- [ ] Plugin
Are all requirements met?
- [ ] Code completed
- [ ] Smoke tested
- [x] Unit tests added
- [ ] Code documentation added
- [ ] Any pending items have an associated Issue
Codecov Report
Merging #348 (dce4e46) into master (76a80ec) will increase coverage by
1.30%
. The diff coverage is100.00%
.
:exclamation: Current head dce4e46 differs from pull request most recent head 2065a96. Consider uploading reports for the commit 2065a96 to get more accurate results
@@ Coverage Diff @@
## master #348 +/- ##
==========================================
+ Coverage 62.76% 64.06% +1.30%
==========================================
Files 148 148
Lines 12444 10080 -2364
==========================================
- Hits 7810 6458 -1352
+ Misses 4038 3026 -1012
Partials 596 596
Flag | Coverage Δ | |
---|---|---|
unittests | ? |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
...o/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go | 79.80% <100.00%> (+0.28%) |
:arrow_up: |
is there any benefit to use only one master node in pytorch CRD? people can just run pytorch in regular python task for single node training, right?
I can only imagine the env vars set by the operator like world size, rank, ...? In the torch elastic task we opted to run single worker trainings in a normal python task/single pod so that users don't need the training operator.
In our case, we start with 0 worker since it's easier to debug, and then adjust to multiple workers.
Although python task can achieve the same thing, we shouldn't error out 0 worker in PyTorch because it's what PyTorch operator allows
In our case, we start with 0 worker since it's easier to debug, and then adjust to multiple workers.
Although python task can achieve the same thing, we shouldn't error out 0 worker in PyTorch because it's what PyTorch operator allows
Fair point. Users can still switch to python task if they prefer by removing the task_config=Pytorch(...
.
Only thing I wonder: in the new pytorch elastic task we decided that with nnodes=1
(meaning only a single worker/pod) we will use a normal python task/pod so that users can run this without operator (see here). Should we maybe then change this to not have different behaviour between elastic and non-elastic pytorch dist training? @kumare3 argued that it would be nice to allow users to use torchrun
without the operator.
I actually want to get rid of a required dependency on PytorchOperator for simple single node training. which can suffice in many cases. This actually makes scaling really nice, you start with one node and simply scale to more nodes - to scale you may need to deploy the operator?
This is why when nnodes=1, we just change the task type itself. WDYT? @fg91 and @ByronHsu
check this out - https://github.com/flyteorg/flytekit/blob/e44b8027bd0f59c0f90eb5f86bfa181aebad0d74/plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py#L112
Could you elaborate the drawback to use PyTorch Operator to training a single node case?
@ByronHsu - FlytePropeller is way more optimal in allocating resources, retrying, and completing sooner. Also this for single node is faster, runs without needing an operator and does not need a CRD to be created.
@kumare3 skipping the CRD part can definitely be faster. Thanks. I will raise a corresponding pr in flytekit
Will merge with https://github.com/flyteorg/flyteplugins/pull/345 and do a integration test
Is this still supposed to be pushed over the finish line or shell we close it?