torchsnapshot issues

Improve error message in _read_snapshot_metadata

Summary: This is to mitigate confusion about what went wrong about the snapshot. Reviewed By: JKSenthil Differential Revision: D54705863

schwarzmx

CLA Signed

fb-exported

Compatibility with python < 3.10

2

Closes #169

vmoens

CLA Signed

torchsnapshot nightlies / torch nightlies are broken on python < 3.10

### 🐛 Describe the bug As part of our CI we incorporate torchsnapshot nightlies tests. Some signatures require python>3.9 as it appears [here](https://github.com/pytorch/rl/actions/runs/8178865549/job/22363713784) The error reads ``` File "/pytorch/rl/env/lib/python3.9/site-packages/torchsnapshot/io_preparers/sharded_tensor.py", line...

vmoens

Pyre Configurationless migration for] [batch:82/112] [shard:6/N]

2

Reviewed By: connernilsen Differential Revision: D54436435

connernilsen

CLA Signed

fb-exported

Unable to read ShardedTensor in torchrec example

3

### 🐛 Describe the bug I'm running the example in `examples/torchrec/main.py` to produce a checkpoint on a multi-gpu node and to subsequently load it. I'm running on 1 node with...

arashd

Check failing UVM test

3

Summary: see if this fixes the error ``` tests/test_uvm_tensor.py::test_uvm_tensor - RuntimeError: CUDA error: invalid device ordinal ``` https://github.com/pytorch/torchsnapshot/actions/runs/5422322107/jobs/9858818161 Differential Revision: D47158196

ananthsub

CLA Signed

fb-exported

Update docstring of handle_sharded_tensor_elasticity

2

Differential Revision: D49399892

daniellepintz

CLA Signed

fb-exported

Not ready for review

7

Summary: Attempt to fix torchsnapshot CI: https://github.com/pytorch/torchsnapshot/actions/runs/5766115388/job/15694536972 ``` tests/test_uvm_tensor.py::test_uvm_tensor FAILED [100%] =================================== FAILURES =================================== _______________________________ test_uvm_tensor ________________________________ pytest.mark.cpu_and_gpu def test_uvm_tensor() -> None: if torch.cuda.is_available() and _UVM_TENSOR_AVAILABLE: uvm_tensor = torch.rand( (64,...

daniellepintz

CLA Signed

fb-exported

Remove ST from torchsnapshot

2

As part of ShardedTensor deprecation, we start the cleanup for its use case in torch snapshot. This is the first PR for a series PR and want to get feedback...

fduwjj

CLA Signed

[S3 storage_plugin] Seeing No credential issue at random intervals when saving / restoring snapshot from S3.

3

### 🐛 Describe the bug When loading snapshot from s3 we are seeing Nocredentials issue happening, this issue happens at random intervals. The issue is very similar to this from...

hbikki

torchsnapshot
torchsnapshot copied to clipboard

Metadata

Improve error message in _read_snapshot_metadata

Compatibility with python < 3.10

torchsnapshot nightlies / torch nightlies are broken on python < 3.10

Pyre Configurationless migration for] [batch:82/112] [shard:6/N]

Unable to read ShardedTensor in torchrec example

Check failing UVM test

Update docstring of handle_sharded_tensor_elasticity

Not ready for review

Remove ST from torchsnapshot

[S3 storage_plugin] Seeing No credential issue at random intervals when saving / restoring snapshot from S3.

← Metadata

Owner

Metadata

torchsnapshot torchsnapshot copied to clipboard

Metadata

← Metadata

Owner

Metadata

torchsnapshot
torchsnapshot copied to clipboard