pytorch-distributed
pytorch-distributed copied to clipboard
Bump horovod from 0.18.2 to 0.24.0
Bumps horovod from 0.18.2 to 0.24.0.
Release notes
Sourced from horovod's releases.
Elastic mode improvements, MXNet async dependency engine, fixes for latest PyTorch and TensorFlow versions
Added
- Ray: Added elastic keyword parameters to RayExecutor API: This API supports both static (non-elastic) and elastic Horovod jobs. (#3190)
- TensorFlow: Added in-place broadcasting of variables. (#3128)
- Elastic: Added support for resurrecting blacklisted hosts. (#3319)
- MXNet: Added support for MXNet async dependency engine. (#3242, #2963)
- Spark/Lightning: Added history to lightning estimator. (#3214)
Changed
- Moved to CMake version 3.13 with first-class CUDA language support and re-enabled parallelized builds. Uses a temporary installation of CMake if CMake 3.13 is not found. (#3261, #3371)
- Moved released Docker image
horovod
andhorovod-cpu
to Ubuntu 20.04 and Python 3.8. (#3393)- Spark Estimator: Don't shuffle row groups if training data requires non-shuffle (#3369)
- Spark/Lightning: Reduced memory footprint of async dataloader. (#3239)
- Elastic: Improved handling NCCL errors under elastic scenario. (#3112)
- Spark/Lightning: Do not overwrite model with checkpoint by default. (#3201)
- Make checkpoint name optional so that user can save to h5 format. (#3411)
Deprecated
- Deprecated ElasticRayExecutor APIs in favor of the new RayExecutor API. (#3190)
Removed
- Spark: Removed
h5py<3
constraint as this is not needed anymore for Tensorflow >2.5.0. (#3301)Fixed
- Elastic Spark: Fixed indices in initial task-to-task registration. (#3410)
- PyTorch: Fixed GIL-related deadlock with PyTorch 1.10.1. (#3352)
- PyTorch: Fixed finalization of ProcessSetTable. (#3351)
- Fixed remote trainers to point to the correct shared lib path. (#3258)
- Fixed imports from
tensorflow.python.keras
with tensorflow 2.6.0+. (#3403)- Fixed Adasum communicator init logic. (#3379)
- Lightning: Fixed resume logger. (#3375)
- Fixed the checkpoint directory structure for pytorch and pytorch lightning. (#3362)
- Fixed possible integer overflow in multiplication. (#3368)
- Fixed the
pytorch_lightning_mnist.py
example. (#3245, #3290)- Fixed barrier segmentation fault. (#3313)
- Fixed
hvd.barrier()
tensor queue management. (#3300)- Fixed PyArrow "list index out of range" IndexError. (#3274)
- Elastic: Fixed all workers sometimes failing on elastic Horovod failure. (#3264)
- Spark/Lightning: Fixed setting
limit_train_batches
andlimit_val_batches
. (#3237)- Elastic: Fixed ElasticSampler and
hvd.elastic.state
losing some indices of processed samples when nodes dropped. (#3143)- Spark/Lightning: Fixed history metrics for estimator serialization. (#3216)
- Ray: Fixed RayExecutor to fail when
num_workers=0
andnum_hosts=None
. (#3210)- Spark/Lightning: Fixed checkpoint callback
dirpath
typo. (#3204)Process sets, XLA support, improved GPU backend
... (truncated)
Changelog
Sourced from horovod's changelog.
[v0.24.0] - 2022-03-01
Added
- Ray: Added elastic keyword parameters to RayExecutor API: This API supports both static (non-elastic) and elastic Horovod jobs. (#3190)
- TensorFlow: Added in-place broadcasting of variables. (#3128)
- Elastic: Added support for resurrecting blacklisted hosts. (#3319)
- MXNet: Added support for MXNet async dependency engine. (#3242, #2963)
- Spark/Lightning: Added history to lightning estimator. (#3214)
Changed
- Moved to CMake version 3.13 with first-class CUDA language support and re-enabled parallelized builds. Uses a temporary installation of CMake if CMake 3.13 is not found. (#3261, #3371)
- Moved released Docker image
horovod
andhorovod-cpu
to Ubuntu 20.04 and Python 3.8. (#3393)- Spark Estimator: Don't shuffle row groups if training data requires non-shuffle (#3369)
- Spark/Lightning: Reduced memory footprint of async dataloader. (#3239)
- Elastic: Improved handling NCCL errors under elastic scenario. (#3112)
- Spark/Lightning: Do not overwrite model with checkpoint by default. (#3201)
- Make checkpoint name optional so that user can save to h5 format. (#3411)
Deprecated
- Deprecated ElasticRayExecutor APIs in favor of the new RayExecutor API. (#3190)
Removed
- Spark: Removed
h5py<3
constraint as this is not needed anymore for Tensorflow >2.5.0. (#3301)Fixed
- Elastic Spark: Fixed indices in initial task-to-task registration. (#3410)
- PyTorch: Fixed GIL-related deadlock with PyTorch 1.10.1. (#3352)
- PyTorch: Fixed finalization of ProcessSetTable. (#3351)
- Fixed remote trainers to point to the correct shared lib path. (#3258)
- Fixed imports from
tensorflow.python.keras
with tensorflow 2.6.0+. (#3403)- Fixed Adasum communicator init logic. (#3379)
- Lightning: Fixed resume logger. (#3375)
- Fixed the checkpoint directory structure for pytorch and pytorch lightning. (#3362)
- Fixed possible integer overflow in multiplication. (#3368)
- Fixed the
pytorch_lightning_mnist.py
example. (#3245, #3290)- Fixed barrier segmentation fault. (#3313)
- Fixed
hvd.barrier()
tensor queue management. (#3300)- Fixed PyArrow "list index out of range" IndexError. (#3274)
- Elastic: Fixed all workers sometimes failing on elastic Horovod failure. (#3264)
- Spark/Lightning: Fixed setting
limit_train_batches
andlimit_val_batches
. (#3237)- Elastic: Fixed ElasticSampler and
hvd.elastic.state
losing some indices of processed samples when nodes dropped. (#3143)- Spark/Lightning: Fixed history metrics for estimator serialization. (#3216)
- Ray: Fixed RayExecutor to fail when
num_workers=0
andnum_hosts=None
. (#3210)- Spark/Lightning: Fixed checkpoint callback
dirpath
typo. (#3204)
... (truncated)
Commits
b089df6
Bump version to 0.24.0 (#3433)db19aa4
Move apt-get into non-interactive mode (#3441)2632c05
Build Horovod with temporarily installed CMake if necessary (#3371)7bf9b04
Make checkpoint name optional so that user can save to h5 format. (#3411)b553974
Fix flaky ray tests (#3430)7b5346e
Fix indices in initial task-to-task registration (#3410)71e10b4
Fixing GPU and CPU TF head CI failures (#3431)79ded4b
Fix FindNVTX.cmake (#3421)642a6b3
[TF - Fix] Fix imports from tensorflow.python.keras with tf.version >= 2....046c071
Allow stderr of executed cmake python code appear in logs (#3398)- Additional commits viewable in compare view
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR:
-
@dependabot rebase
will rebase this PR -
@dependabot recreate
will recreate this PR, overwriting any edits that have been made to it -
@dependabot merge
will merge this PR after your CI passes on it -
@dependabot squash and merge
will squash and merge this PR after your CI passes on it -
@dependabot cancel merge
will cancel a previously requested merge and block automerging -
@dependabot reopen
will reopen this PR if it is closed -
@dependabot close
will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually -
@dependabot ignore this major version
will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) -
@dependabot ignore this minor version
will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) -
@dependabot ignore this dependency
will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -
@dependabot use these labels
will set the current labels as the default for future PRs for this repo and language -
@dependabot use these reviewers
will set the current reviewers as the default for future PRs for this repo and language -
@dependabot use these assignees
will set the current assignees as the default for future PRs for this repo and language -
@dependabot use this milestone
will set the current milestone as the default for future PRs for this repo and language
You can disable automated security fix PRs for this repo from the Security Alerts page.