torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

[feature request] support input/output to fsspec path

Open leoleoasd opened this issue 11 months ago • 2 comments

Support loading model from s3 and save checkpoint to fsspec path.

leoleoasd avatar Dec 31 '24 07:12 leoleoasd

Hi @leoleoasd thanks for creating the issue. @joecummings is working on updating some of our checkpointing abstractions, so he may have some thoughts on this.

ebsmothers avatar Jan 02 '25 17:01 ebsmothers

I researched a little, and I found that safetensor's safeopen uses rust's File::open and MMAP:

https://github.com/huggingface/safetensors/blob/e61e87240d0eabc9749a67ccebe38dca620d48b4/bindings/python/src/lib.rs#L396-L399

So this may not be possible with safetensor?

leoleoasd avatar Jan 02 '25 17:01 leoleoasd

I'll have to dig a little deeper on safetensor specifically, but I think the PyTorch Checkpointing team has worked on enabling S3 read/write through use of DCP StorageReaders and StorageWriters: https://pytorch.org/docs/main/distributed.checkpoint.html#torch.distributed.checkpoint.StorageReader

@saumishr / @ankitageorge Is there a good resource for how to create your own adapter and what might be needed for S3?

joecummings avatar Jan 10 '25 19:01 joecummings

I have a question about Custom Checkpointing interfaces.

I want to write a custom checkpointing class that just puts the data in s3 (to a path I desire) when the current interface finishes writing them to disk. As long as the current interface returns a path to the written checkpoint along with some metadata about what was written (loss/epoch/step), I can figure the rest. Is the current interface capable of this ?

valayDave avatar Feb 01 '25 00:02 valayDave