kubedl
kubedl copied to clipboard
feat: enable pytorch elastic training fashion based on torch elastic
Ⅰ. Describe what this PR does
The PR designs elastic training APIs, adds a torch-elastic controller and implements elastic training control flow on torch-elastic controller and pytorch controller. Currently, the scaling algorithm is based on the real-time batch training latency collected from running pod logs.
- [x] elastic training APIs on pytorchJob spec.
- [x] implement elastic training control flow on torch elastic controller.
- [x] pytorch elastic training job example.
II. Does this pull request fix one issue?
https://github.com/kubedl-io/kubedl/issues/251
@wanziyu hi wanziyu, thanks for your contribution! before merge your PR to master branch, you should sign-off your commit first.
Codecov Report
Merging #267 (90832ba) into master (171c0d7) will increase coverage by
0.18%
. The diff coverage is2.98%
.
@@ Coverage Diff @@
## master #267 +/- ##
==========================================
+ Coverage 28.93% 29.12% +0.18%
==========================================
Files 88 89 +1
Lines 5985 6260 +275
==========================================
+ Hits 1732 1823 +91
- Misses 4000 4174 +174
- Partials 253 263 +10
Flag | Coverage Δ | |
---|---|---|
unittests | 29.12% <2.98%> (+0.18%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
apis/training/v1alpha1/pytorchjob_defaults.go | 17.85% <0.00%> (-1.38%) |
:arrow_down: |
apis/training/v1alpha1/pytorchjob_types.go | 100.00% <ø> (ø) |
|
apis/training/v1alpha1/zz_generated.deepcopy.go | 14.02% <0.00%> (-0.65%) |
:arrow_down: |
controllers/pytorch/elastic_scale.go | 34.04% <ø> (ø) |
|
controllers/pytorch/pytorchjob_controller.go | 0.52% <0.00%> (-0.09%) |
:arrow_down: |
controllers/pytorch/util.go | 0.00% <0.00%> (ø) |
|
pkg/job_controller/job.go | 24.92% <0.00%> (+0.23%) |
:arrow_up: |
pkg/job_controller/service.go | 0.00% <0.00%> (ø) |
|
pkg/job_controller/util.go | 20.83% <100.00%> (-1.39%) |
:arrow_down: |
... and 11 more |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more