DeepSpeedExamples
DeepSpeedExamples copied to clipboard
Bump pytorch-lightning from 1.0.4 to 1.6.0 in /training/MoQ/huggingface-transformers/examples/research_projects/pplm
Bumps pytorch-lightning from 1.0.4 to 1.6.0.
Release notes
Sourced from pytorch-lightning's releases.
PyTorch Lightning 1.6: Support Intel's Habana Accelerator, New efficient DDP strategy (Bagua), Manual Fault-tolerance, Stability and Reliability.
The core team is excited to announce the PyTorch Lightning 1.6 release ⚡
Highlights
PyTorch Lightning 1.6 is the work of 99 contributors who have worked on features, bug-fixes, and documentation for a total of over 750 commits since 1.5. This is our most active release yet. Here are some highlights:
Introducing Intel's Habana Accelerator
Lightning 1.6 now supports the Habana® framework, which includes Gaudi® AI training processors. Their heterogeneous architecture includes a cluster of fully programmable Tensor Processing Cores (TPC) along with its associated development tools and libraries and a configurable Matrix Math engine.
You can leverage the Habana hardware to accelerate your Deep Learning training workloads simply by passing:
trainer = pl.Trainer(accelerator="hpu")single Gaudi training
trainer = pl.Trainer(accelerator="hpu", devices=1)
distributed training with 8 Gaudi
trainer = pl.Trainer(accelerator="hpu", devices=8)
The Bagua Strategy
The Bagua Strategy is a deep learning acceleration framework that supports multiple, advanced distributed training algorithms with state-of-the-art system relaxation techniques. Enabling Bagua, which can be considerably faster than vanilla PyTorch DDP, is as simple as:
trainer = pl.Trainer(strategy="bagua")or to choose a custom algorithm
trainer = pl.Trainer(strategy=BaguaStrategy(algorithm="gradient_allreduce") # default
Towards stable Accelerator, Strategy, and Plugin APIs
The
Accelerator,Strategy, andPluginAPIs are a core part of PyTorch Lightning. They're where all the distributed boilerplate lives, and we're constantly working to improve both them and the overall PyTorch Lightning platform experience.In this release, we've made some large changes to achieve that goal. Not to worry, though! The only users affected by these changes are those who use custom implementations of Accelerator and Strategy (
TrainingTypePlugin) as well as certain Plugins. In particular, we want to highlight the following changes:
- All
TrainingTypePlugins have been renamed toStrategy(#11120). Strategy is a more appropriate name because it encompasses more than simply training communcation. This change is now aligned with the changes we implemented in 1.5, which introduced the newstrategyanddevicesflags to the Trainer.
... (truncated)
Changelog
Sourced from pytorch-lightning's changelog.
[1.6.0] - 2022-03-29
Added
- Allow logging to an existing run ID in MLflow with
MLFlowLogger(#12290)- Enable gradient accumulation using Horovod's
backward_passes_per_step(#11911)- Add new
DETAILlog level to provide useful logs for improving monitoring and debugging of batch jobs (#11008)- Added a flag
SLURMEnvironment(auto_requeue=True|False)to control whether Lightning handles the requeuing (#10601)- Fault Tolerant Manual
- Add
_Statefulprotocol to detect if classes are stateful (#10646)- Add
_FaultTolerantModeenum used to track different supported fault tolerant modes (#10645)- Add a
_rotate_worker_indicesutility to reload the state according the latest worker (#10647)- Add stateful workers (#10674)
- Add an utility to collect the states across processes (#10639)
- Add logic to reload the states across data loading components (#10699)
- Cleanup some fault tolerant utilities (#10703)
- Enable Fault Tolerant Manual Training (#10707)
- Broadcast the
_terminate_gracefullyto all processes and add support for DDP (#10638)- Added support for re-instantiation of custom (subclasses of)
DataLoadersreturned in the*_dataloader()methods, i.e., automatic replacement of samplers now works with custom types ofDataLoader(#10680)- Added a function to validate if fault tolerant training is supported. (#10465)
- Added a private callback to manage the creation and deletion of fault-tolerance checkpoints (#11862)
- Show a better error message when a custom
DataLoaderimplementation is not well implemented and we need to reconstruct it (#10719)- Show a better error message when frozen dataclass is used as a batch (#10927)
- Save the
Loop's state by default in the checkpoint (#10784)- Added
Loop.replaceto easily switch one loop for another (#10324)- Added support for
--lr_scheduler=ReduceLROnPlateauto theLightningCLI(#10860)- Added
LightningCLI.configure_optimizersto override theconfigure_optimizersreturn value (#10860)- Added
LightningCLI(auto_registry)flag to register all subclasses of the registerable components automatically (#12108)- Added a warning that shows when
max_epochsin theTraineris not set (#10700)- Added support for returning a single Callback from
LightningModule.configure_callbackswithout wrapping it into a list (#11060)- Added
console_kwargsforRichProgressBarto initialize inner Console (#10875)- Added support for shorthand notation to instantiate loggers with the
LightningCLI(#11533)- Added a
LOGGER_REGISTRYinstance to register custom loggers to theLightningCLI(#11533)- Added info message when the
Trainerargumentslimit_*_batches,overfit_batches, orval_check_intervalare set to1or1.0(#11950)- Added a
PrecisionPlugin.teardownmethod (#10990)- Added
LightningModule.lr_scheduler_step(#10249)- Added support for no pre-fetching to
DataFetcher(#11606)- Added support for optimizer step progress tracking with manual optimization (#11848)
- Return the output of the
optimizer.step. This can be useful forLightningLiteusers, manual optimization users, or users overridingLightningModule.optimizer_step(#11711)- Teardown the active loop and strategy on exception (#11620)
- Added a
MisconfigurationExceptionif user providedopt_idxin scheduler config doesn't match with actual optimizer index of its respective optimizer (#11247)- Added a
loggersproperty toTrainerwhich returns a list of loggers provided by the user (#11683)- Added a
loggersproperty toLightningModulewhich retrieves theloggersproperty fromTrainer(#11683)- Added support for DDP when using a
CombinedLoaderfor the training data (#11648)- Added a warning when using
DistributedSamplerduring validation/testing (#11479)- Added support for
Baguatraining strategy (#11146)- Added support for manually returning a
poptorch.DataLoaderin a*_dataloaderhook (#12116)- Added
rank_zeromodule to centralize utilities (#11747)- Added a
_Statefulsupport forLightningDataModule(#11637)- Added
_Statefulsupport forPrecisionPlugin(#11638)
... (truncated)
Commits
44e3edbCleanup CHANGELOG (#12507)e3893b9Merge pull request #12509 from RobertLaurella/patch-1041da41Remove TPU Availability check from parse devices (#12326)4fe0076Prepare for the 1.6.0 release17215edFix titles capitalization in docsa775804Update Plugins doc (#12440)71e25f3Update CI in README.md (#12495)c6cb634Add usage of Jupyter magic command for loggers (#12333)42169a2Add typing toLightningModule.trainer(#12345)2de6a9bFix warning message formatting in save_hyperparameters (#12498)- Additional commits viewable in compare view
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR:
@dependabot rebasewill rebase this PR@dependabot recreatewill recreate this PR, overwriting any edits that have been made to it@dependabot mergewill merge this PR after your CI passes on it@dependabot squash and mergewill squash and merge this PR after your CI passes on it@dependabot cancel mergewill cancel a previously requested merge and block automerging@dependabot reopenwill reopen this PR if it is closed@dependabot closewill close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually@dependabot ignore this major versionwill close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this minor versionwill close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this dependencywill close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the Security Alerts page.