torchsnapshot
torchsnapshot copied to clipboard
A performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind.
Summary: This is to mitigate confusion about what went wrong about the snapshot. Reviewed By: JKSenthil Differential Revision: D54705863
### 🐛 Describe the bug As part of our CI we incorporate torchsnapshot nightlies tests. Some signatures require python>3.9 as it appears [here](https://github.com/pytorch/rl/actions/runs/8178865549/job/22363713784) The error reads ``` File "/pytorch/rl/env/lib/python3.9/site-packages/torchsnapshot/io_preparers/sharded_tensor.py", line...
Reviewed By: connernilsen Differential Revision: D54436435
### 🐛 Describe the bug I'm running the example in `examples/torchrec/main.py` to produce a checkpoint on a multi-gpu node and to subsequently load it. I'm running on 1 node with...
Summary: see if this fixes the error ``` tests/test_uvm_tensor.py::test_uvm_tensor - RuntimeError: CUDA error: invalid device ordinal ``` https://github.com/pytorch/torchsnapshot/actions/runs/5422322107/jobs/9858818161 Differential Revision: D47158196
Differential Revision: D49399892
Summary: Attempt to fix torchsnapshot CI: https://github.com/pytorch/torchsnapshot/actions/runs/5766115388/job/15694536972 ``` tests/test_uvm_tensor.py::test_uvm_tensor FAILED [100%] =================================== FAILURES =================================== _______________________________ test_uvm_tensor ________________________________ pytest.mark.cpu_and_gpu def test_uvm_tensor() -> None: if torch.cuda.is_available() and _UVM_TENSOR_AVAILABLE: uvm_tensor = torch.rand( (64,...
As part of ShardedTensor deprecation, we start the cleanup for its use case in torch snapshot. This is the first PR for a series PR and want to get feedback...
### 🐛 Describe the bug When loading snapshot from s3 we are seeing Nocredentials issue happening, this issue happens at random intervals. The issue is very similar to this from...