kubedl icon indicating copy to clipboard operation
kubedl copied to clipboard

feat: enable pytorch elastic training fashion based on torch elastic

Open wanziyu opened this issue 2 years ago • 2 comments

Ⅰ. Describe what this PR does

The PR designs elastic training APIs, adds a torch-elastic controller and implements elastic training control flow on torch-elastic controller and pytorch controller. Currently, the scaling algorithm is based on the real-time batch training latency collected from running pod logs.

  • [x] elastic training APIs on pytorchJob spec.
  • [x] implement elastic training control flow on torch elastic controller.
  • [x] pytorch elastic training job example.

II. Does this pull request fix one issue?

https://github.com/kubedl-io/kubedl/issues/251

wanziyu avatar Aug 15 '22 05:08 wanziyu

@wanziyu hi wanziyu, thanks for your contribution! before merge your PR to master branch, you should sign-off your commit first.

SimonCqk avatar Aug 16 '22 13:08 SimonCqk

Codecov Report

Merging #267 (90832ba) into master (171c0d7) will increase coverage by 0.18%. The diff coverage is 2.98%.

@@            Coverage Diff             @@
##           master     #267      +/-   ##
==========================================
+ Coverage   28.93%   29.12%   +0.18%     
==========================================
  Files          88       89       +1     
  Lines        5985     6260     +275     
==========================================
+ Hits         1732     1823      +91     
- Misses       4000     4174     +174     
- Partials      253      263      +10     
Flag Coverage Δ
unittests 29.12% <2.98%> (+0.18%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
apis/training/v1alpha1/pytorchjob_defaults.go 17.85% <0.00%> (-1.38%) :arrow_down:
apis/training/v1alpha1/pytorchjob_types.go 100.00% <ø> (ø)
apis/training/v1alpha1/zz_generated.deepcopy.go 14.02% <0.00%> (-0.65%) :arrow_down:
controllers/pytorch/elastic_scale.go 34.04% <ø> (ø)
controllers/pytorch/pytorchjob_controller.go 0.52% <0.00%> (-0.09%) :arrow_down:
controllers/pytorch/util.go 0.00% <0.00%> (ø)
pkg/job_controller/job.go 24.92% <0.00%> (+0.23%) :arrow_up:
pkg/job_controller/service.go 0.00% <0.00%> (ø)
pkg/job_controller/util.go 20.83% <100.00%> (-1.39%) :arrow_down:
... and 11 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov-commenter avatar Aug 26 '22 02:08 codecov-commenter