composer
composer copied to clipboard
Support for TorchSnapshot for efficient checkpoint saving and loading
🚀 Feature Request
Add an integration to use https://github.com/pytorch/torchsnapshot
Motivation
TorchSnapshot is a performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind. It includes many optimizations to control for memory usage and optimize checkpoint writing for DDP-style workloads over torch.save
/torch.load
. For more information, please check out the readme: https://github.com/pytorch/torchsnapshot#why-torchsnapshot
Given MosaicML/composer is also a performance-minded library, this integration seems like it'd be a nice addition to the project!
cc @yifuwang
[Optional] Implementation
Additional context
Thank you for filing this @ananthsub, we will evaluate this integration and follow up when we've made a decision.