sockeye [Review #1051 First] Refactoring Ahead of Adding DeepSpeed Support

[Review #1051 First] Refactoring Ahead of Adding DeepSpeed Support

Open mjdenkowski opened this issue 2 years ago • 1 comments

This PR includes #1051. Reviewing/merging it first is recommended.

This PR is the first stage of adding DeepSpeed support to Sockeye's main branch. It ports several changes from the deepspeed branch to improve code modularity and compatibility with PyTorch/DeepSpeed APIs. No DeepSpeed support is added yet.

CLI-level behavior is unchanged except for removing one unused option. Training/inference outputs are identical before and after the changes.

Changes include:

Cleaning up GPU and CPU memory used during training initialization before starting the main training loop.
Moving logic for flagging interleaved key-value parameters from layers.py to model.py.
Refactoring the LearningRateScheduler API to be compatible with PyTorch/DeepSpeed scheduler APIs.
Refactoring the optimizer and learning rate scheduler creation code to improve modularity.
Introducing the ModelWithLoss object, which wraps a Sockeye model and its losses in a single callable module.
Refactoring primary and secondary worker logic to reduce redundant calculations.
Refactoring code for saving/loading training states.
Adding utility code for managing model/training configurations.
Removing unused training option --learning-rate-t-scale.

Pull Request Checklist

[x] Changes are complete (if posting work-in-progress code, prefix your pull request title with '[WIP]' until you can check this box.
[x] Unit tests pass (pytest)
[x] System tests pass (pytest test/system)
[x] Passed code style checking (./style-check.sh)
[x] You have considered writing a test
[x] Updated major/minor version in sockeye/__init__.py. Major version bump if this is a backwards incompatible change.
[x] Updated CHANGELOG.md

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Aug 07 '22 16:08 mjdenkowski

Thanks Michael! I approved #1051.

Aug 09 '22 14:08 tdomhan

Thanks Tobi and Felix! I've made some improvements based on your feedback.

Aug 17 '22 16:08 mjdenkowski

sockeye sockeye copied to clipboard

[Review #1051 First] Refactoring Ahead of Adding DeepSpeed Support

Pull Request Checklist

sockeye
sockeye copied to clipboard