BrainMaGe
BrainMaGe copied to clipboard
Bump pytorch-lightning from 0.8.1 to 1.6.0
Bumps pytorch-lightning from 0.8.1 to 1.6.0.
Release notes
Sourced from pytorch-lightning's releases.
PyTorch Lightning 1.6: Support Intel's Habana Accelerator, New efficient DDP strategy (Bagua), Manual Fault-tolerance, Stability and Reliability.
The core team is excited to announce the PyTorch Lightning 1.6 release ⚡
Highlights
PyTorch Lightning 1.6 is the work of 99 contributors who have worked on features, bug-fixes, and documentation for a total of over 750 commits since 1.5. This is our most active release yet. Here are some highlights:
Introducing Intel's Habana Accelerator
Lightning 1.6 now supports the Habana® framework, which includes Gaudi® AI training processors. Their heterogeneous architecture includes a cluster of fully programmable Tensor Processing Cores (TPC) along with its associated development tools and libraries and a configurable Matrix Math engine.
You can leverage the Habana hardware to accelerate your Deep Learning training workloads simply by passing:
trainer = pl.Trainer(accelerator="hpu")
single Gaudi training
trainer = pl.Trainer(accelerator="hpu", devices=1)
distributed training with 8 Gaudi
trainer = pl.Trainer(accelerator="hpu", devices=8)
The Bagua Strategy
The Bagua Strategy is a deep learning acceleration framework that supports multiple, advanced distributed training algorithms with state-of-the-art system relaxation techniques. Enabling Bagua, which can be considerably faster than vanilla PyTorch DDP, is as simple as:
trainer = pl.Trainer(strategy="bagua")
or to choose a custom algorithm
trainer = pl.Trainer(strategy=BaguaStrategy(algorithm="gradient_allreduce") # default
Towards stable Accelerator, Strategy, and Plugin APIs
The
Accelerator
,Strategy
, andPlugin
APIs are a core part of PyTorch Lightning. They're where all the distributed boilerplate lives, and we're constantly working to improve both them and the overall PyTorch Lightning platform experience.In this release, we've made some large changes to achieve that goal. Not to worry, though! The only users affected by these changes are those who use custom implementations of Accelerator and Strategy (
TrainingTypePlugin
) as well as certain Plugins. In particular, we want to highlight the following changes:
- All
TrainingTypePlugin
s have been renamed toStrategy
(#11120). Strategy is a more appropriate name because it encompasses more than simply training communcation. This change is now aligned with the changes we implemented in 1.5, which introduced the newstrategy
anddevices
flags to the Trainer.
... (truncated)
Changelog
Sourced from pytorch-lightning's changelog.
[1.6.0] - 2022-03-29
Added
- Allow logging to an existing run ID in MLflow with
MLFlowLogger
(#12290)- Enable gradient accumulation using Horovod's
backward_passes_per_step
(#11911)- Add new
DETAIL
log level to provide useful logs for improving monitoring and debugging of batch jobs (#11008)- Added a flag
SLURMEnvironment(auto_requeue=True|False)
to control whether Lightning handles the requeuing (#10601)- Fault Tolerant Manual
- Add
_Stateful
protocol to detect if classes are stateful (#10646)- Add
_FaultTolerantMode
enum used to track different supported fault tolerant modes (#10645)- Add a
_rotate_worker_indices
utility to reload the state according the latest worker (#10647)- Add stateful workers (#10674)
- Add an utility to collect the states across processes (#10639)
- Add logic to reload the states across data loading components (#10699)
- Cleanup some fault tolerant utilities (#10703)
- Enable Fault Tolerant Manual Training (#10707)
- Broadcast the
_terminate_gracefully
to all processes and add support for DDP (#10638)- Added support for re-instantiation of custom (subclasses of)
DataLoaders
returned in the*_dataloader()
methods, i.e., automatic replacement of samplers now works with custom types ofDataLoader
(#10680)- Added a function to validate if fault tolerant training is supported. (#10465)
- Added a private callback to manage the creation and deletion of fault-tolerance checkpoints (#11862)
- Show a better error message when a custom
DataLoader
implementation is not well implemented and we need to reconstruct it (#10719)- Show a better error message when frozen dataclass is used as a batch (#10927)
- Save the
Loop
's state by default in the checkpoint (#10784)- Added
Loop.replace
to easily switch one loop for another (#10324)- Added support for
--lr_scheduler=ReduceLROnPlateau
to theLightningCLI
(#10860)- Added
LightningCLI.configure_optimizers
to override theconfigure_optimizers
return value (#10860)- Added
LightningCLI(auto_registry)
flag to register all subclasses of the registerable components automatically (#12108)- Added a warning that shows when
max_epochs
in theTrainer
is not set (#10700)- Added support for returning a single Callback from
LightningModule.configure_callbacks
without wrapping it into a list (#11060)- Added
console_kwargs
forRichProgressBar
to initialize inner Console (#10875)- Added support for shorthand notation to instantiate loggers with the
LightningCLI
(#11533)- Added a
LOGGER_REGISTRY
instance to register custom loggers to theLightningCLI
(#11533)- Added info message when the
Trainer
argumentslimit_*_batches
,overfit_batches
, orval_check_interval
are set to1
or1.0
(#11950)- Added a
PrecisionPlugin.teardown
method (#10990)- Added
LightningModule.lr_scheduler_step
(#10249)- Added support for no pre-fetching to
DataFetcher
(#11606)- Added support for optimizer step progress tracking with manual optimization (#11848)
- Return the output of the
optimizer.step
. This can be useful forLightningLite
users, manual optimization users, or users overridingLightningModule.optimizer_step
(#11711)- Teardown the active loop and strategy on exception (#11620)
- Added a
MisconfigurationException
if user providedopt_idx
in scheduler config doesn't match with actual optimizer index of its respective optimizer (#11247)- Added a
loggers
property toTrainer
which returns a list of loggers provided by the user (#11683)- Added a
loggers
property toLightningModule
which retrieves theloggers
property fromTrainer
(#11683)- Added support for DDP when using a
CombinedLoader
for the training data (#11648)- Added a warning when using
DistributedSampler
during validation/testing (#11479)- Added support for
Bagua
training strategy (#11146)- Added support for manually returning a
poptorch.DataLoader
in a*_dataloader
hook (#12116)- Added
rank_zero
module to centralize utilities (#11747)- Added a
_Stateful
support forLightningDataModule
(#11637)- Added
_Stateful
support forPrecisionPlugin
(#11638)
... (truncated)
Commits
44e3edb
Cleanup CHANGELOG (#12507)e3893b9
Merge pull request #12509 from RobertLaurella/patch-1041da41
Remove TPU Availability check from parse devices (#12326)4fe0076
Prepare for the 1.6.0 release17215ed
Fix titles capitalization in docsa775804
Update Plugins doc (#12440)71e25f3
Update CI in README.md (#12495)c6cb634
Add usage of Jupyter magic command for loggers (#12333)42169a2
Add typing toLightningModule.trainer
(#12345)2de6a9b
Fix warning message formatting in save_hyperparameters (#12498)- Additional commits viewable in compare view
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR:
-
@dependabot rebase
will rebase this PR -
@dependabot recreate
will recreate this PR, overwriting any edits that have been made to it -
@dependabot merge
will merge this PR after your CI passes on it -
@dependabot squash and merge
will squash and merge this PR after your CI passes on it -
@dependabot cancel merge
will cancel a previously requested merge and block automerging -
@dependabot reopen
will reopen this PR if it is closed -
@dependabot close
will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually -
@dependabot ignore this major version
will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) -
@dependabot ignore this minor version
will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) -
@dependabot ignore this dependency
will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -
@dependabot use these labels
will set the current labels as the default for future PRs for this repo and language -
@dependabot use these reviewers
will set the current reviewers as the default for future PRs for this repo and language -
@dependabot use these assignees
will set the current assignees as the default for future PRs for this repo and language -
@dependabot use this milestone
will set the current milestone as the default for future PRs for this repo and language
You can disable automated security fix PRs for this repo from the Security Alerts page.
Hi Sarthak, Since Pytorch Lightning has changed massively from 0.8.1 to 1.16.0, there are a lot of code changes to be made in order to use this newer version. The only way I see this is to fix the repository entirely. I just tested this with the latest lightning version and the original code errors out.
Hi Sarthak, Since Pytorch Lightning has changed massively from 0.8.1 to 1.16.0, there are a lot of code changes to be made in order to use this newer version. The only way I see this is to fix the repository entirely. I just tested this with the latest lightning version and the original code errors out.
We should discuss how to proceed with this with @sbakas, since this repo is still showing up as active and no other alternative has been provided to users for skull stripping (e.g., via GaNDLF).
Yep. Sounds like this situation needs to be handled delicately. :)
Yep. Sounds like this situation needs to be handled delicately. :)
Okay. So, this is squarely between you and SB.