composer icon indicating copy to clipboard operation
composer copied to clipboard

Support for TorchSnapshot for efficient checkpoint saving and loading

Open ananthsub opened this issue 2 years ago • 1 comments

🚀 Feature Request

Add an integration to use https://github.com/pytorch/torchsnapshot

Motivation

TorchSnapshot is a performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind. It includes many optimizations to control for memory usage and optimize checkpoint writing for DDP-style workloads over torch.save/torch.load. For more information, please check out the readme: https://github.com/pytorch/torchsnapshot#why-torchsnapshot

Given MosaicML/composer is also a performance-minded library, this integration seems like it'd be a nice addition to the project!

cc @yifuwang

[Optional] Implementation

Additional context

ananthsub avatar Oct 24 '22 18:10 ananthsub

Thank you for filing this @ananthsub, we will evaluate this integration and follow up when we've made a decision.

bandish-shah avatar Oct 25 '22 17:10 bandish-shah