BrainMaGe Bump pytorch-lightning from 0.8.1 to 1.6.0

Bumps pytorch-lightning from 0.8.1 to 1.6.0.

Release notes

Sourced from pytorch-lightning's releases.

PyTorch Lightning 1.6: Support Intel's Habana Accelerator, New efficient DDP strategy (Bagua), Manual Fault-tolerance, Stability and Reliability.

The core team is excited to announce the PyTorch Lightning 1.6 release ⚡

Highlights

Backward Incompatible Changes

Full Changelog

Contributors

Highlights

PyTorch Lightning 1.6 is the work of 99 contributors who have worked on features, bug-fixes, and documentation for a total of over 750 commits since 1.5. This is our most active release yet. Here are some highlights:

Introducing Intel's Habana Accelerator

Lightning 1.6 now supports the Habana® framework, which includes Gaudi® AI training processors. Their heterogeneous architecture includes a cluster of fully programmable Tensor Processing Cores (TPC) along with its associated development tools and libraries and a configurable Matrix Math engine.

You can leverage the Habana hardware to accelerate your Deep Learning training workloads simply by passing:
trainer = pl.Trainer(accelerator="hpu")
single Gaudi training
trainer = pl.Trainer(accelerator="hpu", devices=1)
distributed training with 8 Gaudi
trainer = pl.Trainer(accelerator="hpu", devices=8)
The Bagua Strategy

The Bagua Strategy is a deep learning acceleration framework that supports multiple, advanced distributed training algorithms with state-of-the-art system relaxation techniques. Enabling Bagua, which can be considerably faster than vanilla PyTorch DDP, is as simple as:
trainer = pl.Trainer(strategy="bagua")
or to choose a custom algorithm
trainer = pl.Trainer(strategy=BaguaStrategy(algorithm="gradient_allreduce")  # default
Towards stable Accelerator, Strategy, and Plugin APIs

The Accelerator, Strategy, and Plugin APIs are a core part of PyTorch Lightning. They're where all the distributed boilerplate lives, and we're constantly working to improve both them and the overall PyTorch Lightning platform experience.

In this release, we've made some large changes to achieve that goal. Not to worry, though! The only users affected by these changes are those who use custom implementations of Accelerator and Strategy (TrainingTypePlugin) as well as certain Plugins. In particular, we want to highlight the following changes:

All TrainingTypePlugins have been renamed to Strategy (#11120). Strategy is a more appropriate name because it encompasses more than simply training communcation. This change is now aligned with the changes we implemented in 1.5, which introduced the new strategy and devices flags to the Trainer.

... (truncated)

Changelog

Sourced from pytorch-lightning's changelog.

[1.6.0] - 2022-03-29

Added

Allow logging to an existing run ID in MLflow with MLFlowLogger (#12290)

Enable gradient accumulation using Horovod's backward_passes_per_step (#11911)

Add new DETAIL log level to provide useful logs for improving monitoring and debugging of batch jobs (#11008)

Added a flag SLURMEnvironment(auto_requeue=True|False) to control whether Lightning handles the requeuing (#10601)

Fault Tolerant Manual

Add _Stateful protocol to detect if classes are stateful (#10646)

Add _FaultTolerantMode enum used to track different supported fault tolerant modes (#10645)

Add a _rotate_worker_indices utility to reload the state according the latest worker (#10647)

Add stateful workers (#10674)

Add an utility to collect the states across processes (#10639)

Add logic to reload the states across data loading components (#10699)

Cleanup some fault tolerant utilities (#10703)

Enable Fault Tolerant Manual Training (#10707)

Broadcast the _terminate_gracefully to all processes and add support for DDP (#10638)

Added support for re-instantiation of custom (subclasses of) DataLoaders returned in the *_dataloader() methods, i.e., automatic replacement of samplers now works with custom types of DataLoader (#10680)

Added a function to validate if fault tolerant training is supported. (#10465)

Added a private callback to manage the creation and deletion of fault-tolerance checkpoints (#11862)

Show a better error message when a custom DataLoader implementation is not well implemented and we need to reconstruct it (#10719)

Show a better error message when frozen dataclass is used as a batch (#10927)

Save the Loop's state by default in the checkpoint (#10784)

Added Loop.replace to easily switch one loop for another (#10324)

Added support for --lr_scheduler=ReduceLROnPlateau to the LightningCLI (#10860)

Added LightningCLI.configure_optimizers to override the configure_optimizers return value (#10860)

Added LightningCLI(auto_registry) flag to register all subclasses of the registerable components automatically (#12108)

Added a warning that shows when max_epochs in the Trainer is not set (#10700)

Added support for returning a single Callback from LightningModule.configure_callbacks without wrapping it into a list (#11060)

Added console_kwargs for RichProgressBar to initialize inner Console (#10875)

Added support for shorthand notation to instantiate loggers with the LightningCLI (#11533)

Added a LOGGER_REGISTRY instance to register custom loggers to the LightningCLI (#11533)

Added info message when the Trainer arguments limit_*_batches, overfit_batches, or val_check_interval are set to 1 or 1.0 (#11950)

Added a PrecisionPlugin.teardown method (#10990)

Added LightningModule.lr_scheduler_step (#10249)

Added support for no pre-fetching to DataFetcher (#11606)

Added support for optimizer step progress tracking with manual optimization (#11848)

Return the output of the optimizer.step. This can be useful for LightningLite users, manual optimization users, or users overriding LightningModule.optimizer_step (#11711)

Teardown the active loop and strategy on exception (#11620)

Added a MisconfigurationException if user provided opt_idx in scheduler config doesn't match with actual optimizer index of its respective optimizer (#11247)

Added a loggers property to Trainer which returns a list of loggers provided by the user (#11683)

Added a loggers property to LightningModule which retrieves the loggers property from Trainer (#11683)

Added support for DDP when using a CombinedLoader for the training data (#11648)

Added a warning when using DistributedSampler during validation/testing (#11479)

Added support for Bagua training strategy (#11146)

Added support for manually returning a poptorch.DataLoader in a *_dataloader hook (#12116)

Added rank_zero module to centralize utilities (#11747)

Added a _Stateful support for LightningDataModule (#11637)

Added _Stateful support for PrecisionPlugin (#11638)

... (truncated)

Commits

44e3edb Cleanup CHANGELOG (#12507)
e3893b9 Merge pull request #12509 from RobertLaurella/patch-1
041da41 Remove TPU Availability check from parse devices (#12326)
4fe0076 Prepare for the 1.6.0 release
17215ed Fix titles capitalization in docs
a775804 Update Plugins doc (#12440)
71e25f3 Update CI in README.md (#12495)
c6cb634 Add usage of Jupyter magic command for loggers (#12333)
42169a2 Add typing to LightningModule.trainer (#12345)
2de6a9b Fix warning message formatting in save_hyperparameters (#12498)
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR
@dependabot recreate will recreate this PR, overwriting any edits that have been made to it
@dependabot merge will merge this PR after your CI passes on it
@dependabot squash and merge will squash and merge this PR after your CI passes on it
@dependabot cancel merge will cancel a previously requested merge and block automerging
@dependabot reopen will reopen this PR if it is closed
@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
@dependabot use these labels will set the current labels as the default for future PRs for this repo and language
@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

Aug 02 '22 19:08 dependabot[bot]

Hi Sarthak, Since Pytorch Lightning has changed massively from 0.8.1 to 1.16.0, there are a lot of code changes to be made in order to use this newer version. The only way I see this is to fix the repository entirely. I just tested this with the latest lightning version and the original code errors out.

Aug 02 '22 20:08 Geeks-Sid

Hi Sarthak, Since Pytorch Lightning has changed massively from 0.8.1 to 1.16.0, there are a lot of code changes to be made in order to use this newer version. The only way I see this is to fix the repository entirely. I just tested this with the latest lightning version and the original code errors out.

We should discuss how to proceed with this with @sbakas, since this repo is still showing up as active and no other alternative has been provided to users for skull stripping (e.g., via GaNDLF).

Aug 02 '22 20:08 sarthakpati

Yep. Sounds like this situation needs to be handled delicately. :)

Aug 03 '22 16:08 Geeks-Sid

Yep. Sounds like this situation needs to be handled delicately. :)

Okay. So, this is squarely between you and SB.

Aug 03 '22 16:08 sarthakpati

BrainMaGe BrainMaGe copied to clipboard

Bump pytorch-lightning from 0.8.1 to 1.6.0

PyTorch Lightning 1.6: Support Intel's Habana Accelerator, New efficient DDP strategy (Bagua), Manual Fault-tolerance, Stability and Reliability.

Highlights

Introducing Intel's Habana Accelerator

single Gaudi training

distributed training with 8 Gaudi

The Bagua Strategy

or to choose a custom algorithm

Towards stable Accelerator, Strategy, and Plugin APIs

[1.6.0] - 2022-03-29

Added

BrainMaGe
BrainMaGe copied to clipboard