sockeye
sockeye copied to clipboard
[Review #1051 First] Refactoring Ahead of Adding DeepSpeed Support
This PR includes #1051. Reviewing/merging it first is recommended.
This PR is the first stage of adding DeepSpeed support to Sockeye's main
branch. It ports several changes from the deepspeed
branch to improve code modularity and compatibility with PyTorch/DeepSpeed APIs. No DeepSpeed support is added yet.
CLI-level behavior is unchanged except for removing one unused option. Training/inference outputs are identical before and after the changes.
Changes include:
- Cleaning up GPU and CPU memory used during training initialization before starting the main training loop.
- Moving logic for flagging interleaved key-value parameters from layers.py to model.py.
- Refactoring the LearningRateScheduler API to be compatible with PyTorch/DeepSpeed scheduler APIs.
- Refactoring the optimizer and learning rate scheduler creation code to improve modularity.
- Introducing the ModelWithLoss object, which wraps a Sockeye model and its losses in a single callable module.
- Refactoring primary and secondary worker logic to reduce redundant calculations.
- Refactoring code for saving/loading training states.
- Adding utility code for managing model/training configurations.
- Removing unused training option
--learning-rate-t-scale
.
Pull Request Checklist
- [x] Changes are complete (if posting work-in-progress code, prefix your pull request title with '[WIP]' until you can check this box.
- [x] Unit tests pass (
pytest
) - [x] System tests pass (
pytest test/system
) - [x] Passed code style checking (
./style-check.sh
) - [x] You have considered writing a test
- [x] Updated major/minor version in
sockeye/__init__.py
. Major version bump if this is a backwards incompatible change. - [x] Updated CHANGELOG.md
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Thanks Michael! I approved #1051.
Thanks Tobi and Felix! I've made some improvements based on your feedback.