torchsnapshot
torchsnapshot copied to clipboard
A performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind.
🐛 Describe the bug Hello , I am working on training a pretrained hugging face model "t5-small". Using the torchsnpashot examples provided form the documentaion, I am able to save/load...
### 🚀 The feature We'd like to be able to load tensors that are saved on disk but do not yet populate the destination module. ### Motivation, pitch Say we...
### 🚀 The feature Leverage local disk for async snapshot. ### Motivation, pitch TorchSnapshot supports async snapshot, which allows training to resume before the storage I/O of a snapshot completes....
### 🚀 The feature Use [`fsspec`](https://github.com/fsspec/filesystem_spec) as TorchSnapshot's backend. ### Motivation, pitch FSSpec is the FileSystem abstraction standard of Python in fact. It supports many backends like `s3`, `gcs`, `webdav`...
Summary: Allows to load `state_dict`s from disk when the saved copy contains more elements that the pre-populated copy. Test plan: TODO Fixes #{issue number} Closes #101
### 🚀 The feature Hi ! Is Python 3.12 support on the roadmap ? ### Motivation, pitch As it will soon be the de facto version to use, I guess...
Summary: `sock.getsockname` currently expects the address to always be IPV6. An IPv4 address returns two arguments instead of four. This PR fixes that. Test plan: Not sure how to test...
https://github.com/pytorch/torchsnapshot/blob/e8a1fc097b4138493f05e5cac986730d116d0063/torchsnapshot/dist_store.py#L69 ``` master_addr, master_port, _, _ = sock.getsockname() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ValueError: not enough values to unpack (expected 4, got 2) ```
### 📚 Question From what I see, this library's s3 storage plugin uses the [aiobotocore](https://github.com/aio-libs/aiobotocore) library's `put_object` function, which sends a single PutObject request, rather than using multipart uploads for...